Phishing Analysis

Project repository for DA 204o Data Science in Practice (Aug semester 2024) @ IISc BLR

Project Purpose

The purpose of this project is to develop a reliable machine learning-based system for detecting phishing URLs by analyzing their structure, content, and behavior.

Dataset

Source: PhiUSIIL Phishing URL (Website)

Summary: PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed, while constructing the dataset, are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.

Additional Info:

Column "FILENAME" can be ignored.
Label 1 corresponds to a legitimate URL, label 0 to a phishing URL

Project Steps

1. Exploratory Data Analysis (EDA)

Dataset Overview

Basic exploration: info(), describe(), shape, and isnull().
Dataset contains 235,795 rows, no missing or duplicate values.

Key Groupings of Features

URL Characteristics: Length, special characters, obfuscation metrics.
Legitimacy Indicators: HTTPS usage, TLD legitimacy, subdomains.
Web Page Content: Title, favicon, and descriptions.
Web Page Features: Redirects, popups, and social network links.

Hypotheses and Findings

URL Length: Longer URLs are more likely to be phishing.
TLDs: Suspicious TLDs are common in phishing URLs.
HTTPS: Both phishing and legitimate URLs use HTTPS, reducing its reliability as a single indicator.
Obfuscation: Phishing URLs frequently use obfuscation techniques.

For detailed visualizations and analysis, refer to 01-EDA.ipynb.

2. Model Training and Evaluation

Models Trained: Decision Tree, Random Forest, SVM, Naive Bayes.

Steps:

1. Data Preparation: Categorical encoding, feature-target definition, train-test split.

2. Model Training: Built initial models.

3. Hyperparameter Tuning: Used GridSearchCV with cross-validation to optimize parameters.

4. Evaluation: Assessed performance using accuracy, precision, recall, F1-score, ROC-AUC, and PR curves.

Best Model: Random Forest achieved the highest performance.

Saved as rf_model_phiusiil_grid_search.pkl.

Feature Importance: Visualized the Top 10 Most Important Features.

Refer to the below files for details analysis and documentation:

Decision Tree: 02-a-ModelTraining-DecisionTree.ipynb

SVM: 02-b-ModelTraining-SVM.ipynb

Naive Bayes: 02-c-ModelTraining-NaiveBayes.ipynb

Random Forest: 02-d-ModelTraining-RandomForest.ipynb

ComparisonAndBestModelSelection: 02-e-ModelTraining-ComparisonAndBestModelSelection.ipynb

requirements.txt contains list of required Python libraries

3. Model Deployment & Demo

Future Work

(1) Real-Time Phishing Detection: Develop and deploy real-time systems, such as browser extensions or exchange platforms, to detect and block phishing URLs as they appear, providing immediate protection for users.

(2) Model Retraining and Updates: Continuously update and retrain the model with new phishing data to adapt to evolving tactics and improve detection accuracy.

(3) Advanced Feature Integration: Incorporate additional features such as user interaction data and behavioural analysis to improve model robustness and detect more sophisticated phishing attacks.

Team Members

Deepansh Sood
Shambo Samanta
Sudipta Ghosh
Sourajit Bhar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phishing Analysis

Project Purpose

Dataset

Project Steps

1. Exploratory Data Analysis (EDA)

2. Model Training and Evaluation

3. Model Deployment & Demo

Team Members

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
01-EDA.ipynb		01-EDA.ipynb
02-a-ModelTraining-DecisionTree.ipynb		02-a-ModelTraining-DecisionTree.ipynb
02-b-ModelTraining-SVM.ipynb		02-b-ModelTraining-SVM.ipynb
02-c-ModelTraining-NaiveBayes.ipynb		02-c-ModelTraining-NaiveBayes.ipynb
02-d-ModelTraining-RandomForest.ipynb		02-d-ModelTraining-RandomForest.ipynb
02-e-ModelTraining-ComparisonAndBestModelSelection.ipynb		02-e-ModelTraining-ComparisonAndBestModelSelection.ipynb
PhishingAnalysisSystemPresentation.pdf		PhishingAnalysisSystemPresentation.pdf
PhishingAnalysisSystemReport.pdf		PhishingAnalysisSystemReport.pdf
README.md		README.md
requirements.txt		requirements.txt
rf_model_phiusiil_grid_search.pkl		rf_model_phiusiil_grid_search.pkl

CySentinels/DA-204o

Folders and files

Latest commit

History

Repository files navigation

Phishing Analysis

Project Purpose

Dataset

Project Steps

1. Exploratory Data Analysis (EDA)

2. Model Training and Evaluation

3. Model Deployment & Demo

Team Members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages