Ubuntu Airflow Setup 🔝
Clone the repository in your home
directory
/home/<user>/
.
We'll refer this location during the workshop.
$ git clone https://github.com/deliveryhero/pyconde2019-airflow-ml-workshop
To proceed with the workshop, it is necessary to have Python 3.7+ installed on your machine. :warning: Note: The procedure varies according to your distribution and its version.
Be logged in as a user with sudo access for being able to install packages.
Install Pip, Python 3.7+ and venv
$ sudo apt update
$ sudo apt -y upgrade
# Install pip if not already present on your machine
$ sudo apt install python3-pip
# Ubuntu 18.04 ships only Python 3.6. Add this ppa repo to download Python 3.7
$ sudo add-apt-repository ppa:ubuntu-toolchain-r/ppa
# Install Python3.7 and venv for creating virtual environment
$ sudo apt install python3.7 python3.7-dev python3.7-venv
# Create a virtual environment named airflow_env in your home
$ python3.7 -m venv /home/<user>/airflow_env
# Activate it
$ source /home/<user>/airflow_env/bin/activate
Move to the repository directory (it should be in your home directory
/home/<user>/
) for installing Airflow:
# Go to the directory where you download the repo
$ cd /home/<user>/pyconde2019-airflow-ml-workshop
# export the PYTHONPATH
$ export PYTHONPATH=$PYTHONPATH:/home/<user>/pyconde2019-airflow-ml-workshop/
# avoid raise RuntimeError("By default one of Airflow's dependencies installs a GPL "
$ export SLUGIFY_USES_TEXT_UNIDECODE=yes
# upgrade pip
$ pip3.7 install pip --upgrade
# install requirements in the environment
$ pip3.7 install -r requirements.txt
Before launching Airflow, initialise the SQLite Airflow database. The AF Database keeps information about dags, tasks, connections, users, etc. This is the default option, in production you will probably use another RDBMS like MySQL or PostgreSQL. Note: SQLite doesn't allow to parallelise tasks.
Initialise the database:
airflow initdb
Airflow creates the directory ~/airflow/
and it stores inside:
- 🔍 the configuration file
airflow.cfg
- the SQLite DB
airflow.db
- the
log
repository
Export the environment variable AIRFLOW_HOME
$ export AIRFLOW_HOME=/home/<user>/airflow
The cloned repository has inside a subfolder named dags
that contains the DAGs, our workflows python files, that we'll use during this workshop.
✏️ Modify the /home/<user>/airflow/airflow.cfg
: find the dags_folder
parameter and configure it for loading the python files in the dags folder repository.
:white_check_mark: Instead of dags_folder = /home/<user>/airflow/dags
put dags_folder = /home/<user>/pyconde2019-airflow-ml-workshop/dags
Finally everything is ready for running the Airflow webserver!
From the airflow_env
active virtualenv, execute:
$ airflow webserver --port 8080
and then open the browser to localhost:8080.
Check out the Airflow UI:
The Airflow Webserver is running in its virtual environment. We need to activate the same virtual environment but for the Scheduler.
For starting the AF Scheduler, activate a 2nd environment in another terminal (in the ~/pyconde2019-airflow-ml-workshop
directory) and launch the scheduler.
Note: it's foundamental that both Scheduler
and Webserver
have
$ source /home/<user>/airflow_env/bin/activate
$ cd /home/<user>/pyconde2019-airflow-ml-workshop
$ export PYTHONPATH=$PYTHONPATH:/home/<user>/pyconde2019-airflow-ml-workshop/
$ export AIRFLOW_HOME=/home/<user>/airflow
$ airflow scheduler
ERROR - Cannot use more than 1 thread when using sqlite. Setting parallelism to 1
.
This is because we are using Airflow with SQLite DB and the SequentialExecutor.
Executors are the mechanism by which task instances get run. The Sequential one allows to run one task instance at a time (this is not a setup for production). Consider also that SQLite doesn't support multiple connections.
🏆 Great! Now everything is ready for starting the Exercises!
✅ Jump to the Airflow main concepts section for continuing the tutorial.