"Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS."
This repository exemplifies a simple ELT process using delta to perform upsert and remove data files that aren't in the latest state of the transaction log for the table.
- 1.raw-zone-ingestion - first ingestion to raw-zone
- 2.raw-zone-incremental - incremental ingestion (append) to raw-zone
- 3.staging-zone-ingestion - snapshot of the latest state of the table and creation of staging-zone (delta)
- 4.staging-zone-incremental - incremental snapshot ingestion (delta)
- Check scripts (check_raw-zone.py, check_staging-zone.py) - scripts to read and monitor tables being created
- CSV files (titanic.csv, titanic2.csv, titanic3.csv) - simulate changes in tables being ingested
- Directories (raw-zone, staging-zone) - store the data