Starting My ML Journey: Detecting Malicious URLs with Machine Learning

Introduction

Machine learning has endless possibilities, and I decided to kick off my journey by building a simple model to classify URLs as either malicious or benign. Using publicly available datasets, I created and trained a model that extracts meaningful features from URLs and predicts whether they are safe or potentially harmful.

This post outlines the entire process, from data preparation to model evaluation.

Data Preparation

I sourced the data from two places:

Majestic Million: A dataset containing benign URLs.
URLhaus: A collection of known malicious URLs.

After downloading these datasets, I cleaned them to retain only the URLs and assigned labels: 1 for malicious and 0 for benign.

Loading and Merging Data

import pandas as pd

malicious_df = pd.read_csv('data/urls/urlhaus_malicious_urls.csv')
malicious_df['label'] = 1

benign_df = pd.read_csv('data/urls/majestic_million_domain_only.csv')
benign_df['label'] = 0

df = pd.concat([malicious_df, benign_df], ignore_index=True)

To ensure balanced training, I sampled an equal number of URLs from each class:

min_class_size = min(df['label'].value_counts())
balanced_df = df.groupby('label').sample(n=min_class_size, random_state=42)

Feature Extraction

Next, I extracted features from the URLs that could help the model make predictions. The key features included:

URL length
Number of digits
Number of query parameters
Use of IP address
Number of subdomains
Top-level domain (TLD)
Path length
Whether a port number is present

Extracting Features

from urllib.parse import urlparse
import re

def extract_features(url):
    parsed = urlparse(url)
    netloc = parsed.netloc

    if ':' in netloc:
        netloc = netloc.split(':')[0]

    is_ip = re.match(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', netloc)
    features = {
        'url_length': len(url),
        'num_digits': sum(c.isdigit() for c in url),
        'num_parameters': len(parsed.query.split('&')) if parsed.query else 0,
        'uses_ip': 1 if is_ip else 0,
        'num_subdomains': 0,
        'tld': 'ip' if is_ip else None,
        'path_length': len(parsed.path),
        'has_port': 1 if ':' in netloc else 0
    }

    if not is_ip:
        parts = netloc.split('.')
        if len(parts) >= 2:
            features['tld'] = parts[-1]
            features['num_subdomains'] = len(parts) - 2

    return features

Rare TLDs were grouped as "other," and frequencies of each TLD were calculated:

tld_counts = feature_df['tld'].value_counts()
rare_tlds = tld_counts[tld_counts < 10].index
final_df['tld'] = final_df['tld'].replace(rare_tlds, 'other')
tld_frequencies = final_df['tld'].value_counts(normalize=True)
final_df['tld_freq'] = final_df['tld'].map(tld_frequencies).fillna(0)

Model Training

I chose the XGBoost classifier for its efficiency and strong performance with structured data.

Training the Model

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

X = final_df.drop(columns=['label'], axis=1)
y = final_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBClassifier(n_estimators=100, max_depth=5, learning_rate=0.1, eval_metric='logloss')
model.fit(X_train, y_train)

Model Evaluation

I evaluated the model using metrics such as accuracy, confusion matrix, and classification report:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion matrix:\n{conf_matrix}")
print(f"Classification report:\n{report}")

Results

The model achieved an accuracy close to 100% on the test set, indicating strong performance on the given data.

Lessons Learned and Future Plans

This project taught me valuable lessons about:

Feature engineering from URLs.
Handling imbalanced datasets.
Using tree-based models for classification.

In the future, I plan to:

Incorporate additional features, such as WHOIS information.
Test the model on unseen datasets.
Deploy the model as a web service.

Thank you for reading about my first steps in the world of machine learning!