Project repository for DA 231o Data Engineering at Scale (August 2024 Term) @ IISc BLR
The purpose of this project is to develop a reliable machine learning-based system for detecting phishing URLs by analyzing their structure, content, and behavior which can run on distributed system, in order to keep up with continuous increasing data load with number of new legitimate websites getting deployed every second and even more evolving threat actors with new and better ideas.
Source: PhiUSIIL Phishing URL (Website)
Summary: PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed, while constructing the dataset, are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.
Additional Info:
- Column "FILENAME" can be ignored.
- Label 1 corresponds to a legitimate URL, label 0 to a phishing URL
This guide helps you set up Apache Spark in a standalone mode on a Windows system and connect it with Hadoop.
- Java: Download and install the JDK and set
JAVA_HOME
. - Hadoop: Download the Hadoop binaries and configure it.
- Apache Spark: Download the pre-built version for Hadoop from Apache Spark.
- Set the
JAVA_HOME
environment variable and add it toPATH
.
set JAVA_HOME=C:\Program Files\Java\jdk-xx.x.x
set PATH=%JAVA_HOME%\bin;%PATH%
- Extract Hadoop: Extract the Hadoop binaries to a directory (e.g. 'C:\hadoop')
- Configure Hadoop:
-
In the etc/hadoop folder, configure the following files:
- core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>C:\hadoop\data\namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>C:\hadoop\data\datanode</value> </property> </configuration>
- mapred-site.xml:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
- yarn-site.xml:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
-
Format the Hadoop file system:
hdfs namenode -format
-
Start Hadoop services:
start-dfs.cmd start-yarn.cmd
-
- Extract Hadoop: Extract the downloaded pre-built version for Hadoop from Apache Spark (e.g., C:\spark).
- Set the
SPARK_HOME
environment variable and add it toPATH
.
set SPARK_HOME=C:\spark
set PATH=%SPARK_HOME%\bin;%PATH%
- Verify Spark Installation: Open a command prompt and type:
spark-shell
- Edit spark-env.cmd:
- Navigate to C:\spark\conf and rename spark-env.cmd.template to spark-env.cmd.
- Add these following lines:
set HADOOP_HOME=E:\IISC\hadoop set SPARK_DIST_CLASSPATH=%HADOOP_HOME%\bin;%HADOOP_HOME%\lib;%HADOOP_HOME%\etc\hadoop
- Add Hadoop Binary Path: Ensure %HADOOP_HOME%\bin is in your system PATH for Spark to recognize Hadoop executables.
- Launch spark-shell and try accessing HDFS:
val rdd = sc.textFile("hdfs://localhost:9000/path/to/file")
rdd.collect().foreach(println)
Dataset Overview
- Basic exploration: printSchema(), describe(), and dropDuplicates().
- Dataset contains 235,795 rows, no missing or duplicate values.
Key Groupings of Features
- URL Characteristics: Length, special characters, obfuscation metrics.
- Legitimacy Indicators: HTTPS usage, TLD legitimacy, subdomains.
- Web Page Content: Title, favicon, and descriptions.
- Web Page Features: Redirects, popups, and social network links.
Hypotheses and Findings
- URL Length: Longer URLs are more likely to be phishing.
- TLDs: Suspicious TLDs are common in phishing URLs.
- HTTPS: Both phishing and legitimate URLs use HTTPS, reducing its reliability as a single indicator.
- Obfuscation: Phishing URLs frequently use obfuscation techniques.
For detailed visualizations and analysis, refer to 01-EDA.ipynb.
Models Trained: Decision Tree, Random Forest, SVM, Naive Bayes.
Steps:
1. Data Preparation: Categorical encoding, feature-target definition, train-test split.
2. Model Training: Built initial models.
3. Hyperparameter Tuning: Used GridSearchCV with cross-validation to optimize parameters.
4. Evaluation: Assessed performance using accuracy, precision, recall, F1-score, ROC-AUC, and PR curves.
Random Forest achieved the highest performance. Saved as best_random_forest_model.model.zip.
- 01_EDA.ipynb: The Jupyter Notebook file of Exploratory Data Analysis on the collected data, for a combined look of code, markdown, and outputs in a single document.
- 02-ModelTraining.ipynb: The Jupyter Notebook file of Model Training and Saving the Best Model to local system, for a combined look of code, markdown, and outputs in a single document.
- To run on the Single-Node Spark Cluster with HDFS (To Setup the same in Windows Environment steps are given above)
- eda.py
- model_training.py
- requirements.txt: contains list of required Python libraries
- Shambo Samanta
- Deepansh Sood
- Sudipta Ghosh
- Sourajit Bhar