This is a simple ETL (Extract, Transform, & Load) using Airflow. Step :
- Extract dataset from the API
- Store the dataset to Google Cloud Storage
- Retreive the dataset from Google Cloud Storage and clean it
- Store cleaned dataset to stagging (Google Cloud Storage)
- Store the final dataset in Google BigQuery
Install docker deskto from official docker website
https://www.docker.com/products/docker-desktop/
Generate key of Service Account on IAM & Admin Google Cloud Console and store it as Connection in Airflow Admin
{
"type": "service_account",
"project_id": "Your Project ID",
"private_key_id": "Your Private ID",
"private_key": "Your Private Key",
"client_email": "Your Client Key",
"client_id": "Your Client ID",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/xxx.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}
Step by step to use Airflow
- Clone this repository
- Install & Open Docker
- Install & Open Visual Studio Code
- Open Docker & Visual Studio Code
- Create a new file .env and add the following lines
AIRFLOW_IMAGE_NAME=apache/airflow:2.4.2
AIRFLOW_UID=50000
- Build container using this command on command:
docker-compose up -d
- Create Admin user using bellow command:
docker-compose run airflow-worker airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
- Login to Airflow on and insert username and password:
http://localhost:8080/
-
Setting Up Connection, You can create a connection to Google Cloud from Airflow webserver admin menu. In this menu you can pass the Service Account key file path: In this picture, the keyfile Path is '/usr/local/airflow/dags/gcp.json.' Beforehand you need to mount your key file as a volume in your Docker container with the previous path. You can also directly copy the key json content in the Airflow connection, in the keyfile Json field :
Google Cloud Composer is a managed workflow orchestration service built on Apache Airflow, an open-source platform designed for orchestrating complex workflows. At its core, Composer simplifies the process of scheduling, managing, and monitoring workflows and data pipelines. Here are step to deploy Airflow DAGs on Cloud Composer:
-
Activate the Cloud Composer API on Google Cloud Platform
-
Choose the environments version, for this tutorial use the Composer 3
-
Setup network & advance configuration if necessary and create environment
-
Wait until environtment created
-
Setting Up Connection, You can create a connection to Google Cloud from Airflow webserver admin menu. In this menu you can pass the Service Account key file path: In this picture, the keyfile Path is '/usr/local/airflow/dags/gcp.json.' Beforehand you need to mount your key file as a volume in your Docker container with the previous path. You can also directly copy the key json content in the Airflow connection, in the keyfile Json field :