Skip to content

Commit

Permalink
Update documentation (#25)
Browse files Browse the repository at this point in the history
+ some refactring
+ some improves

 Changes to be committed:
	modified: .gitignore
	modified: README.md
	renamed: docs/concepts/concepts.md
	new file: docs/concepts/usecases.md
	new file: docs/gen_ref_pages.py
	modified: docs/index.md
	renamed: docs/notebooks/basic_example.ipynb
	new file: docs/notebooks/using_where.ipynb
	new file: docs/python/main_structures.md
	modified: mkdocs.yml
	new file: static/protobuf/protobuf-diagram.svg
	new file: static/protobuf/protobuf-diagram.uml
	modified: tsumugi-python/poetry.lock
	modified: tsumugi-python/pyproject.toml
	modified: tsumugi-python/tsumugi/__init__.py
	modified: tsumugi-python/tsumugi/analyzers.py
	modified: tsumugi-python/tsumugi/anomaly_detection.py
	modified: tsumugi-python/tsumugi/checks.py
	deleted: tsumugi-python/tsumugi/examples/base_example.py
	deleted: tsumugi-python/tsumugi/examples/base_example_classic.py
	modified: tsumugi-python/tsumugi/verification.py
  • Loading branch information
SemyonSinchenko authored Sep 20, 2024
1 parent 787a5ca commit 269a77b
Show file tree
Hide file tree
Showing 21 changed files with 1,480 additions and 341 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -227,3 +227,6 @@ dev/spark-*

# Spark
tmp/*

# Auto-generated
docs/python/reference/*
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,4 +118,10 @@ poetry env use python3.10 # 3.10+
poetry install
```

## References

Tsumugi is built on top of Deequ Data Quality tool:

- _Schelter, Sebastian, et al. "Automating large-scale data quality verification." Proceedings of the VLDB Endowment 11.12 (2018): 1781-1794._, [link](https://www.amazon.science/publications/automating-large-scale-data-quality-verification?ref=https://githubhelp.com);
- _Schelter, Sebastian, et al. "Unit testing data with deequ." Proceedings of the 2019 International Conference on Management of Data. 2019._, [link](https://www.amazon.science/publications/unit-testing-data-with-deequ);
- _Schelter, Sebastian, et al. "Deequ-data quality validation for machine learning pipelines." (2018)._, [link](https://www.amazon.science/publications/deequ-data-quality-validation-for-machine-learning-pipelines);
1 change: 1 addition & 0 deletions docs/concepts.md → docs/concepts/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Since the project is primarily about creating a wrapper, it adheres to the same

You can read more about that in the publications by the authors.

- _Schelter, Sebastian, et al. "Automating large-scale data quality verification." Proceedings of the VLDB Endowment 11.12 (2018): 1781-1794._, [link](https://www.amazon.science/publications/automating-large-scale-data-quality-verification?ref=https://githubhelp.com);
- _Schelter, Sebastian, et al. "Unit testing data with deequ." Proceedings of the 2019 International Conference on Management of Data. 2019._, [link](https://www.amazon.science/publications/unit-testing-data-with-deequ);
- _Schelter, Sebastian, et al. "Deequ-data quality validation for machine learning pipelines." (2018)._, [link](https://www.amazon.science/publications/deequ-data-quality-validation-for-machine-learning-pipelines);

Expand Down
32 changes: 32 additions & 0 deletions docs/concepts/usecases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Usecases for Deequ

Compared to other data quality tools that primarily follow zero-code or low-code paradigms, Deequ is a code-first solution that provides a well-designed and stable programming API. This approach makes Deequ highly flexible, allowing you to design your own YAML-like low-code API with a structure that fits your specific domain. In essence, Deequ functions more as a data quality engine, adaptable to most possible use cases. Some of these use cases are described below.

## Data Profiling

The first and most obvious use case for Deequ is data profiling. In this scenario, one does not need to use `Check`, `Constraint`, or `AnomalyDetection` features. It would be sufficient to simply add all the analyzers to the `required_analyzers` section of the `VerificationSuite`. This approach would not produce any check results or row-level results, but instead generate a simple table with computed metrics per instance (which may be a `Column` or `Dataset`).

As a result, you will obtain a list of metrics and their corresponding values. Since both Deequ and Tsumugi are code-first (rather than YAML-first) frameworks, it will be very easy to customize your profiling based on the data types in your `DataFrame`.

## Static constraints based on the business rules

The next potential use case for Deequ is to add static constraints to your tables and use row-level results to quarantine data that fails to meet these constraints. For example, if you have a string column in one of your tables that should always contain exactly 14 characters, such as a mobile phone number, you can add a constraint specifying that both `MaxLength` and `MinLength` should be exactly 14. You can then use row-level results to identify which rows passed the constraint and which did not. These row-level results will contain your data along with one boolean column for each `Check`, indicating whether the row passed all the constraints in that `Check` or not. Another good option might be the `PatternMatch` analyzer, which could be used to check if a column contains a valid email address and quarantine the row if it doesn't.

## Detecting data-drift for ML-inference

Another excellent use case for Deequ is as a data drift detector for checking input in ML model batch inference. Imagine we have an ML-based recommender system that updates pre-computed recommendations daily for our users for the following day. This scenario is a good example of batch inference, where we have an ML model trained once on training data and run it each day offline on new data. For such a system, it is crucial to ensure that the data hasn't changed significantly compared to previous batches. If it has, it signals that our ML model should be retrained on more recent training data.

As we can see, there are no static constraints here. Rather than fitting our data into strict boundaries, we aim to ensure that data drift remains within acceptable limits.

This scenario presents a perfect use case for Deequ Anomaly Detection. Let's imagine we have an ML model trained on the following features:

1. Duration of customer relationship (numeric)
2. Paid subscription status (boolean, can be NULL)
3. Frequency of service usage (numeric)

In this case, we can apply the following Anomaly Detection checks:

- Assuming that the average, minimum, and maximum frequency of service usage should not change dramatically, we can apply a `RelativeRateOfChange` strategy. By setting maximum increase and minimum decrease values to 1.1 and 0.9 respectively, we allow for a ±10% drift. Any new batch that shows significant changes compared to the previous batch of data will be considered an anomaly in this case.
- Because our model uses a missing value imputation strategy to fill NULLs in a flag column, we need to ensure that the amount of NULLs is similar to the data on which we trained our ML model. For this case, a `SimpleThresholdStrategy` is a good choice: we can set maximum and minimum allowed drift limits, and any data that falls within this range will be considered acceptable.
Regarding the frequency of service usage, we know that the value should approximate a Normal Distribution. This means we can apply the `BatchNormalStrategy` to our batch intervals and ensure that the data is actually normally distributed by using thresholds for the mean and standard deviation of metrics.

39 changes: 39 additions & 0 deletions docs/gen_ref_pages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""Generate the code reference pages and navigation.
Script was taken from
https://mkdocstrings.github.io/recipes/#automatic-code-reference-pages
"""

from pathlib import Path

import mkdocs_gen_files

nav = mkdocs_gen_files.Nav()

for path in sorted(Path(".").rglob("tsumugi/**/*.py")):
if "proto" in path.absolute().__str__():
# We do not need to expose generated code
continue
module_path = path.relative_to(".").with_suffix("")
doc_path = path.relative_to(".").with_suffix(".md")
full_doc_path = Path("python/reference", doc_path)

parts = tuple(module_path.parts)

if parts[-1] == "__init__":
parts = parts[:-1]
doc_path = doc_path.with_name("index.md")
full_doc_path = full_doc_path.with_name("index.md")
elif parts[-1] == "__main__":
continue

nav[parts] = doc_path.as_posix() #

with mkdocs_gen_files.open(full_doc_path, "w") as fd:
ident = ".".join(parts)
fd.write(f"::: {ident}")

mkdocs_gen_files.set_edit_path(full_doc_path, path)

with mkdocs_gen_files.open("python/reference/SUMMARY.md", "w") as nav_file:
nav_file.writelines(nav.build_literate_nav())
12 changes: 12 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,15 @@ A modern PySpark Connect/Classic wrapper on top of the Deequ, a beautiful Data Q
![](https://raw.githubusercontent.com/SemyonSinchenko/tsumugi-spark/main/static/tsumugi-spark-logo.png)

**_NOTE:_** _Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga "Knights of Sidonia", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.)._

## Table on contents

- Lungauge agnostic concepts and usecases for Deequ
* [Concepts of Deequ](concepts/concepts.md)
* [Possible usecases for Deequ](concepts/usecases.md)
- PySpark Connect / Classic API
* [Main data structures and classes](python/main_structures.md)
* [API Docs (auto-generated)](python/reference/SUMMARY.md)
- Example notebooks
* [Basic example](notebooks/basic_example.ipynb)
* [Using predicates with analyzers](notebooks/using_where.ipynb)
Loading

0 comments on commit 269a77b

Please sign in to comment.