This repository is a tutorial explaining how to train a simple decision tree classifier to detect websites that are used for phishing. Typically, phishing websites disguise as trustworthy websites in order to gain the trust of their victims, and malicious parties use them to obtain sensitive information from their victims: e.g., passwords or credit card numbers. In this tutorial, we train a decision tree to detect such websites, with a success rate of 90.5%.
To get started, you should first clone this repository by running the
following command from a UNIX
terminal.
git clone https://github.com/shubham-pawar/phishing-detection
This will download the code that trains the phishing detector, as well as the training data required for that operation.
You should also install scikit-learn
, which is a collection of tools
for machine learning written in Python. You can find instructions on how
to install it here. On
a UNIX
machine configured
with pip
, the simplest way is to
run:
pip install -U scikit-learn
Once you have installed scikit-learn
, you can check whether the
library is correctly setup by typing the following in a Python shell:
import sklearn
If the command runs with no error, you are ready to train the phishing detector!
In this tutorial, we use a dataset of phishing website publicly
available on the machine learning repository
provided by UCI. You don't have to download the dataset yourself
as it is included directly in this repository (dataset.csv
file) and
was downloaded on your machine when you cloned this repository.
The dataset was collected by analyzing a collection of 2456
websites
among which some were used for phishing and others not. For each website
included in the dataset, 30
attributes are given.
Each website in the dataset is labeled by -1
if it is not a phishing
website and by 1
if it is a website used for phishing.
python decision_tree.py
This will first train the decision tree on 2,000
websites, then use
the trained model to predict whether 456
websites are used for
phishing or not (these websites were not analyzed during training).
The model should make predictions that are about 90.5% correct, i.e. the
accuracy of the model on the testing data should be 90.5%. Here is a
dump of the output made by the script.
Tutorial: Training a decision tree to detect phishing websites
Training data loaded.
Decision tree classifier created.
Beginning model training.
Model training completed.
Predictions on testing data computed.
The accuracy of your decision tree on testing data is: 0.906129210381
To understand how this was done, you can read the line by line comments
in the decision_tree.py
file.
You can try improving the accuracy of this simple classifier by changing
some of the default parameter values for the model. This is done by
modifying file decision_tree.py
. To learn more about the
parameters that you can define when calling DecisionTreeClassifier()
,
take a look at the scikit-learn
documentation.
Credits
The credits for this code go to npapernot. I've merely created a wrapper to get people started.