Security data case study

Malicious URL Classification

This repo benchmarks what lexical URL features can separate in a malicious versus benign classification setting. The public repo includes cleaning steps, sampled training workflows, saved outputs, and run metadata. It does not use DNS, WHOIS, page content, or temporal signals.

651,191 Rows loaded from the original malicious URL CSV snapshot.
640,792 Rows kept after deterministic cleaning and conflict removal.
80,000 Stratified modeling sample used to keep k-NN evaluation practical.
3-fold CV Saved model selection and test evaluation with random seed 42.

Current scope

Question
Estimate how well lexical URL features alone can distinguish benign and malicious traffic labels.
Built
Cleaning, feature engineering, binary and multiclass training workflows, CSV result exports, and reproducibility metadata for a statistics and data mining project.
Artifacts
Saved test metrics, class distributions, confusion matrices, feature-set comparison outputs, and a run metadata file with hash and sample details.
Completed
Lexical-only modeling, class-by-class evaluation, and a documented explanation of why the k-NN workflow uses an 80,000-row sample.
Not included
DNS, content, WHOIS, and temporal features are outside the current benchmark.

Modeling notes

Best saved model

k-NN produced the strongest saved test metrics, while logistic regression remained easier to inspect and explain.

Sampling tradeoff

The 80,000-row sample keeps repeated k-NN cross-validation practical for the course scope.

Feature boundary

The models use lexical features only, so the results are narrower than a production phishing or malware detection system.

Workflow artifact

This diagram summarizes the current repo pipeline using the numbers from the saved run_metadata.json and result files.

Raw CSV 651,191 rows url + type Cleaning Drop malformed rows Drop duplicates Resolve conflicts 640,792 rows kept Features Length counts Host / path stats Suspicious tokens Models Baseline Logistic k-NN 80,000 sample 3-fold CV Outputs CSV metrics Confusion matrices Run metadata

The repo records random_state = 42, test_size = 0.2, and the selected feature set as extended for the saved binary workflow.

Class distribution artifact

The cleaned dataset remains highly imbalanced. This is why balanced accuracy and macro F1 matter more here than headline accuracy alone.

benign
66.78%
defacement
14.87%
phishing
14.66%
malware
3.69%

Counts from the saved cleaned_class_distribution.csv file: benign 427,931, defacement 95,285, phishing 93,931, malware 23,645.

Held-out binary test metrics

Model Accuracy Balanced acc. Macro F1
k-NN 0.9444 0.9351 0.9371
Logistic regression 0.8616 0.8425 0.8436
Majority baseline 0.6678 0.5000 0.4004

The saved binary metrics also report k-NN precision 0.9241, recall 0.9072, and F1 0.9156 on the benign-vs-malicious task.

Held-out multiclass test metrics

Model Accuracy Balanced acc. Macro F1
k-NN 0.9270 0.8795 0.8911
Logistic regression 0.7644 0.7958 0.7251
Majority baseline 0.6678 0.2500 0.2002

Per-class metrics show phishing as the hardest multiclass category for both major models. For k-NN, phishing recall is 0.7330, lower than benign, defacement, and malware.

Result highlights

  • k-NN is the strongest predictive model in the saved outputs for both binary and multiclass tasks.
  • Logistic regression gives up some performance, but it is easier to interpret and reason about.
  • Phishing remains the hardest multiclass category in the saved evaluation outputs.
  • The selected extended feature set only slightly outperformed the core set in cross-validation.

Current limitations

  • The workflow is lexical-only. It does not inspect page content, DNS, WHOIS, or temporal behavior.
  • The 80,000-row sample is a practical compute tradeoff, not a full-dataset benchmark.
  • These metrics are meaningful for this dataset, but they are not evidence of production readiness.
  • The project is a classification benchmark, not a deployed detection system.