Skip to content

Commit

Permalink
Added 81 questions
Browse files Browse the repository at this point in the history
Added questions to ask for platorm & pipeline design
  • Loading branch information
team-data-science committed Dec 11, 2024
1 parent d863461 commit ed4d04e
Show file tree
Hide file tree
Showing 3 changed files with 135 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,9 @@ Find the change log with all recent updates here: [SEE UPDATES](sections/10-Upda
- [Scaling Up](sections/03-AdvancedSkills.md#scaling-up)
- [Scaling Out](sections/03-AdvancedSkills.md#scaling-out)
- [When not to Do Big Data](sections/03-AdvancedSkills.md#please-dont-go-big-data)
- [Platform & Pipeline Design basics](sections/03-AdvancedSkills.md#platform-and-pipeline-design-basics)
- [Data Source Questions](sections/03-AdvancedSkills.md#data-source-questions)
- [Goals and Destination Questions](sections/03-AdvancedSkills.md#goals-and-destination-questions)
- [Connect](sections/03-AdvancedSkills.md#connect)
- [REST APIs](sections/03-AdvancedSkills.md#rest-apis)
- [API Design](sections/03-AdvancedSkills.md#api-design)
Expand Down
129 changes: 129 additions & 0 deletions sections/03-AdvancedSkills.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ Advanced Data Engineering Skills
- [Scaling Up](03-AdvancedSkills.md#scaling-up)
- [Scaling Out](03-AdvancedSkills.md#scaling-out)
- [When not to Do Big Data](03-AdvancedSkills.md#please-dont-go-big-data)
- [Platform & Pipeline Design basics](03-AdvancedSkills.md#platform-and-pipeline-design-basics)
- [Data Source Questions](03-AdvancedSkills.md#data-source-questions)
- [Goals and Destination Questions](03-AdvancedSkills.md#goals-and-destination-questions)
- [Connect](03-AdvancedSkills.md#connect)
- [REST APIs](03-AdvancedSkills.md#rest-apis)
- [API Design](03-AdvancedSkills.md#api-design)
Expand Down Expand Up @@ -336,6 +339,132 @@ If you don't need it it's making absolutely no sense at all!
On the other side: If you really need big data tools they will save your
ass :)

## Platform and Pipeline Design Basics
Many people ask: "How do you select the platform, tools and design the pipelines?"
The options seem infinite. Technology however should never dictate the decisions.

Here are 81 questions that you should answer when starting a project


### Data Source Questions
(Comprehensive Questions for Data Engineers)

#### Data Origin and Structure
- **What is the source?** Understand the "device."
- **What is the format of the incoming data?** (e.g., JSON, CSV, Avro, Parquet)
- **What’s the schema?**
- **Is the data structured, semi-structured, or unstructured?**
- **What is the data type?** Understand the content of the data.
- **Is the schema well-defined, or is it dynamic?**
- **How are changes in the data structure from the source (schema evolution) handled?**

#### Data Volume & Velocity
- **How much data is transmitted per transmission?**
- **How fast is the data coming in?** (e.g., messages per minute)
- **What is the maximum data volume expected per source per day?**
- **What scaling of sources/data is expected?**
- **Are there peaks for incoming data?**
- **How much data is posted per day across all sources?**
- **How does the data volume fluctuate?** (e.g., seasonal peaks, hourly/daily variations)
- **How will the system handle bursts of data?** (e.g., throttling or buffering)

#### Source Reliability & Redundancy
- **Is there data arriving late?**
- **Is there a risk of duplicate data from the source?** How will we handle de-duplication?
- **How reliable are the sources?** What’s the expected failure rate?
- **How do we handle data corruption or loss during transmission?**
- **What happens if a source goes offline?** Is there a fallback or failover source?
- **Do we need to retry failed transmissions or have fault-tolerance mechanisms in place?**

#### Data Extraction & New Sources
- **Do we need to extract the data from the sources?**
- **How many sources are there?**
- **Will new sources be implemented?**

#### Data Source Connectivity & Authentication
- **How is the data arriving?** (API, bucket, etc.)
- **How is the authentication done?**
- **What kind of connection is required for the data source?** (e.g., streaming, batch, API)
- **What protocols are used for data ingestion?** (e.g., REST, WebSocket, FTP)
- **Are there any rate limits or quotas imposed by the data source?**
- **How do we handle credentials?** Is there an API?
- **What is the retry strategy if data fails to be processed or transmitted?**

#### Data Security & Compliance
- **Does the data need to be encrypted at the source before being transmitted?**
- **Are there any compliance frameworks (e.g., GDPR, HIPAA) that the source data must adhere to?**
- **Is there a requirement for data masking or obfuscation at the source?**

#### Metadata & Audit
- **Is there metadata for the client transmission stored somewhere?**
- **What metadata should be captured for each transmission?** (e.g., record counts, latency)
- **How do we track and log data ingestion events for audit purposes?**
- **Is there a need for tracking data lineage?** (i.e., source origin and changes over time)

---

### Goals and Destination Questions
(Comprehensive Questions for Data Engineers)

#### Use Case & Data Consumption
- **What kind of use case is this?** (Analytics, BI, ML, Transactional processing, Visualization, User Interfaces, APIs)
- **What are the typical use cases that require this data?** (e.g., predictive analytics, operational dashboards)
- **What are the downstream systems or platforms that will consume this data?**
- **How critical is real-time data versus historical data in this use case?**

#### Data Query & Delivery
- **How is the data visualized?** (raw data, aggregated data)
- **How much raw data is processed at once?**
- **How much data is cold data, and how often is cold data queried?**
- **How fast do the results need to appear?**
- **How much data is going to be queried at once?**
- **How fresh does the data need to be?**
- **How often is the data queried?** (frequency)
- **What are the SLAs for delivering data to downstream systems or applications?**

#### Aggregation & Modeling
- **How is the data aggregated?** (by devices, topic, time)
- **When does the aggregation happen?** (on query, on schedule, while streaming)
- **What kind of data models are needed for this use case?** (e.g., star schema, snowflake schema)
- **Is there a need for pre-aggregations to speed up queries?**
- **Should partitioning or indexing strategies be implemented to optimize query performance?**

#### Performance & Availability
- **What is the processing time requirement?**
- **What is the availability of analytics output?** (input vs output delay)
- **How fresh does the data need to be?**
- **What are the performance expectations for query speed?**
- **What is the acceptable query response time for end-users?**
- **How will the system handle an increase in concurrent queries from multiple users?**
- **What is the expected lag between data ingestion and availability for querying?**
- **Do we need horizontal scaling for query engines or databases?**

#### Data Lifecycle & Retention
- **What’s the data retention time?**
- **How often is data archived or moved to lower-cost storage?**
- **Will old data need to be transformed or reprocessed for new use cases?**
- **What are the data retention policies?** (e.g., hot vs cold storage)
- **How will the use case evolve as the data grows?** Will this affect how data is consumed or visualized?

#### Monitoring & Debugging
- **How will data delivery to the destination be monitored?** (e.g., time-to-load, query failures)
- **How will we monitor data pipeline health at the destination?** (e.g., throughput, latency)
- **What tools or methods will be used for debugging data delivery failures or performance bottlenecks?**
- **What metrics should be tracked to ensure data pipeline health?** (e.g., latency, throughput)
- **How do we handle issues such as data corruption or incomplete data at the destination?**

#### Data Access & Permissions
- **Who is working with the platform, and who has access to query or visualize the data?**
- **Which tools are used to query the data?**
- **What kind of data export capabilities are required?** (e.g., CSV, API, direct database access)
- **Is role-based access control (RBAC) needed to segment data views for different users?**
- **How will access to sensitive data be managed?** (e.g., row-level security, encryption)

#### Scaling & Future Requirements
- **What are the scalability requirements for the data platform as data volume grows?**
- **How will future business goals or scalability needs affect the design of data aggregation and retention strategies?**
- **How will the system handle an increasing load as more users query data or as data volume grows?**


## Connect

Expand Down
3 changes: 3 additions & 0 deletions sections/10-Updates.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@ Updates
============

What's new? Here you can find a list of all the updates with links to the sections
- **2024-12-11**
- Prepared the most important questions for platform & pipeline design. Specifically looking at the data source and the goals [click here](03-AdvancedSkills.md#platform-and-pipeline-design-basics)


- **2024-11-28**
- Prepared a GenAI RAG example project that you can run on your own computer without internet. It uses Ollama with Mistral model and Elasticsearch. Working on a way of creating embeddings from pdf files and inserting them into Elsaticsearch for queries [click here](04-HandsOnCourse.md#genai-retrieval-augmented-generation-with-ollama-and-elasticsearch)
Expand Down

0 comments on commit ed4d04e

Please sign in to comment.