Security data case study
Malicious URL Classification
This repo benchmarks what lexical URL features can separate in a malicious versus benign classification setting. The public repo includes cleaning steps, sampled training workflows, saved outputs, and run metadata. It does not use DNS, WHOIS, page content, or temporal signals.
42.
Current scope
- Question
- Estimate how well lexical URL features alone can distinguish benign and malicious traffic labels.
- Built
- Cleaning, feature engineering, binary and multiclass training workflows, CSV result exports, and reproducibility metadata for a statistics and data mining project.
- Artifacts
- Saved test metrics, class distributions, confusion matrices, feature-set comparison outputs, and a run metadata file with hash and sample details.
- Completed
- Lexical-only modeling, class-by-class evaluation, and a documented explanation of why the k-NN workflow uses an 80,000-row sample.
- Not included
- DNS, content, WHOIS, and temporal features are outside the current benchmark.
Modeling notes
k-NN produced the strongest saved test metrics, while logistic regression remained easier to inspect and explain.
The 80,000-row sample keeps repeated k-NN cross-validation practical for the course scope.
The models use lexical features only, so the results are narrower than a production phishing or malware detection system.
Workflow artifact
This diagram summarizes the current repo pipeline using the numbers from the saved run_metadata.json
and result files.
The repo records random_state = 42, test_size = 0.2, and the selected feature set as extended for the saved binary workflow.
Class distribution artifact
The cleaned dataset remains highly imbalanced. This is why balanced accuracy and macro F1 matter more here than headline accuracy alone.
Counts from the saved cleaned_class_distribution.csv file: benign 427,931, defacement 95,285, phishing 93,931, malware 23,645.
Held-out binary test metrics
| Model | Accuracy | Balanced acc. | Macro F1 |
|---|---|---|---|
| k-NN | 0.9444 | 0.9351 | 0.9371 |
| Logistic regression | 0.8616 | 0.8425 | 0.8436 |
| Majority baseline | 0.6678 | 0.5000 | 0.4004 |
The saved binary metrics also report k-NN precision 0.9241, recall 0.9072, and F1 0.9156 on the benign-vs-malicious task.
Held-out multiclass test metrics
| Model | Accuracy | Balanced acc. | Macro F1 |
|---|---|---|---|
| k-NN | 0.9270 | 0.8795 | 0.8911 |
| Logistic regression | 0.7644 | 0.7958 | 0.7251 |
| Majority baseline | 0.6678 | 0.2500 | 0.2002 |
Per-class metrics show phishing as the hardest multiclass category for both major models. For k-NN, phishing recall is 0.7330, lower than benign, defacement, and malware.
Result highlights
- k-NN is the strongest predictive model in the saved outputs for both binary and multiclass tasks.
- Logistic regression gives up some performance, but it is easier to interpret and reason about.
- Phishing remains the hardest multiclass category in the saved evaluation outputs.
- The selected
extendedfeature set only slightly outperformed thecoreset in cross-validation.
Current limitations
- The workflow is lexical-only. It does not inspect page content, DNS, WHOIS, or temporal behavior.
- The 80,000-row sample is a practical compute tradeoff, not a full-dataset benchmark.
- These metrics are meaningful for this dataset, but they are not evidence of production readiness.
- The project is a classification benchmark, not a deployed detection system.