Project repository for DA 204o Data Science in Practice (Aug semester 2024) @ IISc BLR
The purpose of this project is to develop a reliable machine learning-based system for detecting phishing URLs by analyzing their structure, content, and behavior.
Source: PhiUSIIL Phishing URL (Website)
Summary: PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed, while constructing the dataset, are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.
Additional Info:
- Column "FILENAME" can be ignored.
- Label 1 corresponds to a legitimate URL, label 0 to a phishing URL
Dataset Overview
- Basic exploration: info(), describe(), shape, and isnull().
- Dataset contains 235,795 rows, no missing or duplicate values.
Key Groupings of Features
- URL Characteristics: Length, special characters, obfuscation metrics.
- Legitimacy Indicators: HTTPS usage, TLD legitimacy, subdomains.
- Web Page Content: Title, favicon, and descriptions.
- Web Page Features: Redirects, popups, and social network links.
Hypotheses and Findings
- URL Length: Longer URLs are more likely to be phishing.
- TLDs: Suspicious TLDs are common in phishing URLs.
- HTTPS: Both phishing and legitimate URLs use HTTPS, reducing its reliability as a single indicator.
- Obfuscation: Phishing URLs frequently use obfuscation techniques.
For detailed visualizations and analysis, refer to 01-EDA.ipynb.
Models Trained: Decision Tree, Random Forest, SVM, Naive Bayes.
Steps:
1. Data Preparation: Categorical encoding, feature-target definition, train-test split.
2. Model Training: Built initial models.
3. Hyperparameter Tuning: Used GridSearchCV with cross-validation to optimize parameters.
4. Evaluation: Assessed performance using accuracy, precision, recall, F1-score, ROC-AUC, and PR curves.
Best Model: Random Forest achieved the highest performance.
Saved as rf_model_phiusiil_grid_search.pkl.
Feature Importance: Visualized the Top 10 Most Important Features.
Refer to the below files for details analysis and documentation:
Decision Tree: 02-a-ModelTraining-DecisionTree.ipynb
SVM: 02-b-ModelTraining-SVM.ipynb
Naive Bayes: 02-c-ModelTraining-NaiveBayes.ipynb
Random Forest: 02-d-ModelTraining-RandomForest.ipynb
ComparisonAndBestModelSelection: 02-e-ModelTraining-ComparisonAndBestModelSelection.ipynb
requirements.txt contains list of required Python libraries
Future Work
(1) Real-Time Phishing Detection: Develop and deploy real-time systems, such as browser extensions or exchange platforms, to detect and block phishing URLs as they appear, providing immediate protection for users.
(2) Model Retraining and Updates: Continuously update and retrain the model with new phishing data to adapt to evolving tactics and improve detection accuracy.
(3) Advanced Feature Integration: Incorporate additional features such as user interaction data and behavioural analysis to improve model robustness and detect more sophisticated phishing attacks.
- Deepansh Sood
- Shambo Samanta
- Sudipta Ghosh
- Sourajit Bhar