This week we will look at some data wrangling on a tabular dataset. We will then fit a decision tree and a random forest model to the data.
This class focuses more on the tools and concepts you might encounter related to cyberinfrastructure. This means we are not going to cover the mathematics behind the machine learning algorithms in much depth. But I encourage you to look at these materials if you find the techniques interesting. This and the next tutorials are based on the Practical Deep Learning for Coders lessons by Jeremy Howard. References are included in the Resources section at the end of this file.
This section is modeled after the excellent tutorial by Jeremy Howard titled "How Random Forests Really Work". I recommend looking at this for more detail on how decision trees and random forests work.
Open your VM, and git pull
in the cicf
folder.
sudo apt install graphviz
pip install --upgrade jupyter-core nbconvert seaborn fastai
We are going to work with the Titanic dataset. Lets first look at the dataset This dataset is a passenger manifest from the Titanic.
The rest of this section is in the notebook random-forest.ipynb.
Sources for the tutorial notebook:
- Random Forests - Practical Deep Learning for Coders
- "How Random Forests Really Work".
- Neural Net Foundations
Other Interesting links:
- Google Colab provides a Jupyter notebook-like interface on top of a cloud computing platform. Definately worth looking at.
- Titanic - Machine Learning from Disaster is the Kaggle competition I mentioned.- Astronomers Dig Up the Stars That Birthed the Milky Way
- Cornell Machine Learning Course
- Google Machine Learning Course