DA-231o

Project repository for DA 231o Data Engineering at Scale (August 2024 Term) @ IISc BLR

Project Purpose

The purpose of this project is to develop a reliable machine learning-based system for detecting phishing URLs by analyzing their structure, content, and behavior which can run on distributed system, in order to keep up with continuous increasing data load with number of new legitimate websites getting deployed every second and even more evolving threat actors with new and better ideas.

Dataset

Source: PhiUSIIL Phishing URL (Website)

Summary: PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed, while constructing the dataset, are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.

Additional Info:

Column "FILENAME" can be ignored.
Label 1 corresponds to a legitimate URL, label 0 to a phishing URL

Setting Up Apache Spark with Hadoop on Windows

This guide helps you set up Apache Spark in a standalone mode on a Windows system and connect it with Hadoop.

Prerequisites

Java: Download and install the JDK and set JAVA_HOME.
Hadoop: Download the Hadoop binaries and configure it.
Apache Spark: Download the pre-built version for Hadoop from Apache Spark.

Steps

1. Set Up Java

Set the JAVA_HOME environment variable and add it to PATH.

set JAVA_HOME=C:\Program Files\Java\jdk-xx.x.x
set PATH=%JAVA_HOME%\bin;%PATH%

2. Set Up Hadoop

Extract Hadoop: Extract the Hadoop binaries to a directory (e.g. 'C:\hadoop')

Configure Hadoop:

In the etc/hadoop folder, configure the following files:

core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

hdfs-site.xml:

<configuration>
  <property>
	<name>dfs.replication</name>
	<value>1</value>
  </property>
  <property>
	<name>dfs.namenode.name.dir</name>
	<value>C:\hadoop\data\namenode</value>
  </property>
  <property>
	<name>dfs.datanode.data.dir</name>
	<value>C:\hadoop\data\datanode</value>
  </property>
</configuration>

mapred-site.xml:

<configuration>
  <property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
  </property>
</configuration>

yarn-site.xml:

<configuration>
  <property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
  </property>
  <property>
	<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

Format the Hadoop file system:
```
hdfs namenode -format
```
Start Hadoop services:
```
start-dfs.cmd
start-yarn.cmd
```

3. Set Up Apache Spark

Extract Hadoop: Extract the downloaded pre-built version for Hadoop from Apache Spark (e.g., C:\spark).
Set the SPARK_HOME environment variable and add it to PATH.

set SPARK_HOME=C:\spark
set PATH=%SPARK_HOME%\bin;%PATH%

Verify Spark Installation: Open a command prompt and type:

spark-shell

4. Configure Spark to Use Hadoop

Edit spark-env.cmd:
- Navigate to C:\spark\conf and rename spark-env.cmd.template to spark-env.cmd.
- Add these following lines:
```
set HADOOP_HOME=E:\IISC\hadoop
set SPARK_DIST_CLASSPATH=%HADOOP_HOME%\bin;%HADOOP_HOME%\lib;%HADOOP_HOME%\etc\hadoop
```
Add Hadoop Binary Path: Ensure %HADOOP_HOME%\bin is in your system PATH for Spark to recognize Hadoop executables.

5. Test HDFS Connection in Spark

Launch spark-shell and try accessing HDFS:

val rdd = sc.textFile("hdfs://localhost:9000/path/to/file")
rdd.collect().foreach(println)

Project Steps

1. Exploratory Data Analysis (EDA)

Dataset Overview

Basic exploration: printSchema(), describe(), and dropDuplicates().
Dataset contains 235,795 rows, no missing or duplicate values.

Key Groupings of Features

URL Characteristics: Length, special characters, obfuscation metrics.
Legitimacy Indicators: HTTPS usage, TLD legitimacy, subdomains.
Web Page Content: Title, favicon, and descriptions.
Web Page Features: Redirects, popups, and social network links.

Hypotheses and Findings

URL Length: Longer URLs are more likely to be phishing.
TLDs: Suspicious TLDs are common in phishing URLs.
HTTPS: Both phishing and legitimate URLs use HTTPS, reducing its reliability as a single indicator.
Obfuscation: Phishing URLs frequently use obfuscation techniques.

For detailed visualizations and analysis, refer to 01-EDA.ipynb.

2. Model Training and Evaluation

Models Trained: Decision Tree, Random Forest, SVM, Naive Bayes.

Steps:

1. Data Preparation: Categorical encoding, feature-target definition, train-test split.

2. Model Training: Built initial models.

3. Hyperparameter Tuning: Used GridSearchCV with cross-validation to optimize parameters.

4. Evaluation: Assessed performance using accuracy, precision, recall, F1-score, ROC-AUC, and PR curves.

Best Model:

Random Forest achieved the highest performance. Saved as best_random_forest_model.model.zip.

Refer to the below files for details analysis and documentation:

01_EDA.ipynb: The Jupyter Notebook file of Exploratory Data Analysis on the collected data, for a combined look of code, markdown, and outputs in a single document.
02-ModelTraining.ipynb: The Jupyter Notebook file of Model Training and Saving the Best Model to local system, for a combined look of code, markdown, and outputs in a single document.
To run on the Single-Node Spark Cluster with HDFS (To Setup the same in Windows Environment steps are given above)
- eda.py
- model_training.py
requirements.txt: contains list of required Python libraries

Contributors

Shambo Samanta
Deepansh Sood
Sudipta Ghosh
Sourajit Bhar

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
01_EDA.ipynb		01_EDA.ipynb
02-ModelTraining.ipynb		02-ModelTraining.ipynb
README.md		README.md
best_random_forest_model.model.zip		best_random_forest_model.model.zip
eda.py		eda.py
model_training.py		model_training.py
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DA-231o

Project Purpose

Dataset

Setting Up Apache Spark with Hadoop on Windows

Prerequisites

Steps

1. Set Up Java

2. Set Up Hadoop

3. Set Up Apache Spark

4. Configure Spark to Use Hadoop

5. Test HDFS Connection in Spark

Project Steps

1. Exploratory Data Analysis (EDA)

2. Model Training and Evaluation

Best Model:

Refer to the below files for details analysis and documentation:

Contributors

About

Releases

Packages

Contributors 3

Languages

CySentinels/DA-231o

Folders and files

Latest commit

History

Repository files navigation

DA-231o

Project Purpose

Dataset

Setting Up Apache Spark with Hadoop on Windows

Prerequisites

Steps

1. Set Up Java

2. Set Up Hadoop

3. Set Up Apache Spark

4. Configure Spark to Use Hadoop

5. Test HDFS Connection in Spark

Project Steps

1. Exploratory Data Analysis (EDA)

2. Model Training and Evaluation

Best Model:

Refer to the below files for details analysis and documentation:

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages