Purpuse of this project

The purpuse of this project is to apply the knowledge of the snowflake data engineering badge. For this project i will creata a data pipeline in order to retrieve the production of biomethane in france. Indeed everyday, in data.gouv the french governement publish a csv files with the production of biomethane in france by date, productor, are (region). The csv publish everyday contains all the historic + the production of the precedent day.

Schema

Explanation of the schema

Like we say in the purpuse first we retrieve with the link provide by data.gouv the latest csv files. We use a google cloud function that run a python code. This python code load the csv in google cloud storage. This google cloud function is trigger every night at 23h00 by a google cloud scheduler and i also implement a CI/CD between this github repo and the python file. By using Google cloud deploy. Every time a file is load in the bucket a message is send to a topic. Snowflake who is a subscriber so every time a message is send in the topic a pipe is trigger in order to retrieve to integrate the last CSV into the stage table. After that every twelve hours (we can't trigger a snowpipe with an event in april 2024). We merge into the prod table the latest raw. And then we compute the metrics.

How to implement the project

Fisrt in order to implement this project you need to have :

Snowflake account
GCP account with a facturation
Activate the API of pub/sub,GCS, Cloud function, cloud schedueler
Link you github account with google cloud build
Having the google cloud SDK install in your local machine or VM

1) create the bucket

Two way to create a bucke :

Using CLI
Using the console

Important you need to create a folder inside you bucket !

2) develop the google cloud fucntions

Develop the python program with a main function with one paramater (requets) the python program need to be name main.py (you can name the python program whatever you want but you need to modify the gcloud code).

then deploy it using this commandands in the local repository

gcloud functions deploy "name_function" --runtime python39 --trigger-http --allow-unauthenticated --set-env-vars GOOGLE_CLOUD_PROJECT= [name_secret] # we set up a environnement secret for this project

gcloud scheduler jobs create http "name_scheduler" \
>> --schedule "0 23 * * *" \                                                                                                                     
>> --http-method GET \
>> --uri url_function \
>> --oidc-service-account-email "service-account-email" \

4) create the integration of the GCS bucket in stage

You need to follow the documentation in this link :

https://docs.snowflake.com/en/user-guide/data-load-gcs-config

5) create the pub/sub link between GCS and snowflake

first you need to follow the documentation here :

https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-gcs

but the documentation have two problems first you need to gave to the iam account of snowflake the monitor view right and the seconds is they use a quote between the name but that triger an error as you can see in my stack overflow blog :

https://stackoverflow.com/questions/78327105/troubleshoot-to-create-a-notification-integrations-between-gcs-and-snowflake?noredirect=1#comment138092331_78327105

6) create CI/CD for the python code

This video explain well how to do that :

https://www.youtube.com/watch?v=TeguiTFprFo&pp=ygUhQ0kvQ0QgZ2l0aHViIGNsb3VkIGZuY3Rpb24gcHl0aG9u

don't forget to add a .cloudingore for ignoring all the file that isn't useful (copy my file)

7) Create CI/CD between snowflake and the github repo

Again follow the new snowflake documentation :

https://docs.snowflake.com/en/developer-guide/git/git-overview

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.cloudignore		.cloudignore
.gitignore		.gitignore
cloudbuild.yaml		cloudbuild.yaml
compute.sql		compute.sql
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpuse of this project

Schema

Explanation of the schema

How to implement the project

1) create the bucket

2) develop the google cloud fucntions

4) create the integration of the GCS bucket in stage

5) create the pub/sub link between GCS and snowflake

6) create CI/CD for the python code

7) Create CI/CD between snowflake and the github repo

About

Releases

Packages

Languages

Francois-lenne/biomethane

Folders and files

Latest commit

History

Repository files navigation

Purpuse of this project

Schema

Explanation of the schema

How to implement the project

1) create the bucket

2) develop the google cloud fucntions

4) create the integration of the GCS bucket in stage

5) create the pub/sub link between GCS and snowflake

6) create CI/CD for the python code

7) Create CI/CD between snowflake and the github repo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages