- Published on
Starting My ML Journey: Detecting Malicious URLs with Machine Learning
- Authors
- Name
- codebuff
Starting My ML Journey: Detecting Malicious URLs with Machine Learning
Introduction
Machine learning has endless possibilities, and I decided to kick off my journey by building a simple model to classify URLs as either malicious or benign. Using publicly available datasets, I created and trained a model that extracts meaningful features from URLs and predicts whether they are safe or potentially harmful.
This post outlines the entire process, from data preparation to model evaluation.
Data Preparation
I sourced the data from two places:
- Majestic Million: A dataset containing benign URLs.
- URLhaus: A collection of known malicious URLs.
After downloading these datasets, I cleaned them to retain only the URLs and assigned labels: 1 for malicious and 0 for benign.
Loading and Merging Data
import pandas as pd
malicious_df = pd.read_csv('data/urls/urlhaus_malicious_urls.csv')
malicious_df['label'] = 1
benign_df = pd.read_csv('data/urls/majestic_million_domain_only.csv')
benign_df['label'] = 0
df = pd.concat([malicious_df, benign_df], ignore_index=True)
To ensure balanced training, I sampled an equal number of URLs from each class:
min_class_size = min(df['label'].value_counts())
balanced_df = df.groupby('label').sample(n=min_class_size, random_state=42)
Feature Extraction
Next, I extracted features from the URLs that could help the model make predictions. The key features included:
- URL length
- Number of digits
- Number of query parameters
- Use of IP address
- Number of subdomains
- Top-level domain (TLD)
- Path length
- Whether a port number is present
Extracting Features
from urllib.parse import urlparse
import re
def extract_features(url):
parsed = urlparse(url)
netloc = parsed.netloc
if ':' in netloc:
netloc = netloc.split(':')[0]
is_ip = re.match(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', netloc)
features = {
'url_length': len(url),
'num_digits': sum(c.isdigit() for c in url),
'num_parameters': len(parsed.query.split('&')) if parsed.query else 0,
'uses_ip': 1 if is_ip else 0,
'num_subdomains': 0,
'tld': 'ip' if is_ip else None,
'path_length': len(parsed.path),
'has_port': 1 if ':' in netloc else 0
}
if not is_ip:
parts = netloc.split('.')
if len(parts) >= 2:
features['tld'] = parts[-1]
features['num_subdomains'] = len(parts) - 2
return features
Rare TLDs were grouped as "other," and frequencies of each TLD were calculated:
tld_counts = feature_df['tld'].value_counts()
rare_tlds = tld_counts[tld_counts < 10].index
final_df['tld'] = final_df['tld'].replace(rare_tlds, 'other')
tld_frequencies = final_df['tld'].value_counts(normalize=True)
final_df['tld_freq'] = final_df['tld'].map(tld_frequencies).fillna(0)
Model Training
I chose the XGBoost classifier for its efficiency and strong performance with structured data.
Training the Model
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
X = final_df.drop(columns=['label'], axis=1)
y = final_df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = XGBClassifier(n_estimators=100, max_depth=5, learning_rate=0.1, eval_metric='logloss')
model.fit(X_train, y_train)
Model Evaluation
I evaluated the model using metrics such as accuracy, confusion matrix, and classification report:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion matrix:\n{conf_matrix}")
print(f"Classification report:\n{report}")
Results
The model achieved an accuracy close to 100% on the test set, indicating strong performance on the given data.
Lessons Learned and Future Plans
This project taught me valuable lessons about:
- Feature engineering from URLs.
- Handling imbalanced datasets.
- Using tree-based models for classification.
In the future, I plan to:
- Incorporate additional features, such as WHOIS information.
- Test the model on unseen datasets.
- Deploy the model as a web service.
Thank you for reading about my first steps in the world of machine learning!