Update documentation (#25)

+ some refactring + some improves Changes to be committed: modified: .gitignore modified: README.md renamed: docs/concepts/concepts.md new file: docs/concepts/usecases.md new file: docs/gen_ref_pages.py modified: docs/index.md renamed: docs/notebooks/basic_example.ipynb new file: docs/notebooks/using_where.ipynb new file: docs/python/main_structures.md modified: mkdocs.yml new file: static/protobuf/protobuf-diagram.svg new file: static/protobuf/protobuf-diagram.uml modified: tsumugi-python/poetry.lock modified: tsumugi-python/pyproject.toml modified: tsumugi-python/tsumugi/__init__.py modified: tsumugi-python/tsumugi/analyzers.py modified: tsumugi-python/tsumugi/anomaly_detection.py modified: tsumugi-python/tsumugi/checks.py deleted: tsumugi-python/tsumugi/examples/base_example.py deleted: tsumugi-python/tsumugi/examples/base_example_classic.py modified: tsumugi-python/tsumugi/verification.py
mrpowers-io · Sep 20, 2024 · 269a77b · 269a77b
1 parent 787a5ca
commit 269a77b
Show file tree

Hide file tree

Showing 21 changed files with 1,480 additions and 341 deletions.
diff --git a/.gitignore b/.gitignore
@@ -227,3 +227,6 @@ dev/spark-*
 
 # Spark
 tmp/*
+
+# Auto-generated
+docs/python/reference/*
diff --git a/README.md b/README.md
@@ -118,4 +118,10 @@ poetry env use python3.10 # 3.10+
 poetry install
 ```
 
+## References
 
+Tsumugi is built on top of Deequ Data Quality tool:
+
+- _Schelter, Sebastian, et al. "Automating large-scale data quality verification." Proceedings of the VLDB Endowment 11.12 (2018): 1781-1794._, [link](https://www.amazon.science/publications/automating-large-scale-data-quality-verification?ref=https://githubhelp.com);
+- _Schelter, Sebastian, et al. "Unit testing data with deequ." Proceedings of the 2019 International Conference on Management of Data. 2019._, [link](https://www.amazon.science/publications/unit-testing-data-with-deequ);
+- _Schelter, Sebastian, et al. "Deequ-data quality validation for machine learning pipelines." (2018)._, [link](https://www.amazon.science/publications/deequ-data-quality-validation-for-machine-learning-pipelines);
diff --git a/docs/concepts.md → docs/concepts/concepts.md b/docs/concepts.md → docs/concepts/concepts.md
@@ -4,6 +4,7 @@ Since the project is primarily about creating a wrapper, it adheres to the same
 
 You can read more about that in the publications by the authors.
 
+- _Schelter, Sebastian, et al. "Automating large-scale data quality verification." Proceedings of the VLDB Endowment 11.12 (2018): 1781-1794._, [link](https://www.amazon.science/publications/automating-large-scale-data-quality-verification?ref=https://githubhelp.com);
 - _Schelter, Sebastian, et al. "Unit testing data with deequ." Proceedings of the 2019 International Conference on Management of Data. 2019._, [link](https://www.amazon.science/publications/unit-testing-data-with-deequ);
 - _Schelter, Sebastian, et al. "Deequ-data quality validation for machine learning pipelines." (2018)._, [link](https://www.amazon.science/publications/deequ-data-quality-validation-for-machine-learning-pipelines);
 

diff --git a/docs/concepts/usecases.md b/docs/concepts/usecases.md
@@ -0,0 +1,32 @@
+# Usecases for Deequ
+
+Compared to other data quality tools that primarily follow zero-code or low-code paradigms, Deequ is a code-first solution that provides a well-designed and stable programming API. This approach makes Deequ highly flexible, allowing you to design your own YAML-like low-code API with a structure that fits your specific domain. In essence, Deequ functions more as a data quality engine, adaptable to most possible use cases. Some of these use cases are described below.
+
+## Data Profiling
+
+The first and most obvious use case for Deequ is data profiling. In this scenario, one does not need to use `Check`, `Constraint`, or `AnomalyDetection` features. It would be sufficient to simply add all the analyzers to the `required_analyzers` section of the `VerificationSuite`. This approach would not produce any check results or row-level results, but instead generate a simple table with computed metrics per instance (which may be a `Column` or `Dataset`).
+
+As a result, you will obtain a list of metrics and their corresponding values. Since both Deequ and Tsumugi are code-first (rather than YAML-first) frameworks, it will be very easy to customize your profiling based on the data types in your `DataFrame`.
+
+## Static constraints based on the business rules
+
+The next potential use case for Deequ is to add static constraints to your tables and use row-level results to quarantine data that fails to meet these constraints. For example, if you have a string column in one of your tables that should always contain exactly 14 characters, such as a mobile phone number, you can add a constraint specifying that both `MaxLength` and `MinLength` should be exactly 14. You can then use row-level results to identify which rows passed the constraint and which did not. These row-level results will contain your data along with one boolean column for each `Check`, indicating whether the row passed all the constraints in that `Check` or not. Another good option might be the `PatternMatch` analyzer, which could be used to check if a column contains a valid email address and quarantine the row if it doesn't.
+
+## Detecting data-drift for ML-inference
+
+Another excellent use case for Deequ is as a data drift detector for checking input in ML model batch inference. Imagine we have an ML-based recommender system that updates pre-computed recommendations daily for our users for the following day. This scenario is a good example of batch inference, where we have an ML model trained once on training data and run it each day offline on new data. For such a system, it is crucial to ensure that the data hasn't changed significantly compared to previous batches. If it has, it signals that our ML model should be retrained on more recent training data.
+
+As we can see, there are no static constraints here. Rather than fitting our data into strict boundaries, we aim to ensure that data drift remains within acceptable limits.
+
+This scenario presents a perfect use case for Deequ Anomaly Detection. Let's imagine we have an ML model trained on the following features:
+
+1. Duration of customer relationship (numeric)
+2. Paid subscription status (boolean, can be NULL)
+3. Frequency of service usage (numeric)
+
+In this case, we can apply the following Anomaly Detection checks:
+
+- Assuming that the average, minimum, and maximum frequency of service usage should not change dramatically, we can apply a `RelativeRateOfChange` strategy. By setting maximum increase and minimum decrease values to 1.1 and 0.9 respectively, we allow for a ±10% drift. Any new batch that shows significant changes compared to the previous batch of data will be considered an anomaly in this case.
+- Because our model uses a missing value imputation strategy to fill NULLs in a flag column, we need to ensure that the amount of NULLs is similar to the data on which we trained our ML model. For this case, a `SimpleThresholdStrategy` is a good choice: we can set maximum and minimum allowed drift limits, and any data that falls within this range will be considered acceptable.
+Regarding the frequency of service usage, we know that the value should approximate a Normal Distribution. This means we can apply the `BatchNormalStrategy` to our batch intervals and ensure that the data is actually normally distributed by using thresholds for the mean and standard deviation of metrics.
+
diff --git a/docs/gen_ref_pages.py b/docs/gen_ref_pages.py
@@ -0,0 +1,39 @@
+"""Generate the code reference pages and navigation.
+
+Script was taken from
+https://mkdocstrings.github.io/recipes/#automatic-code-reference-pages
+"""
+
+from pathlib import Path
+
+import mkdocs_gen_files
+
+nav = mkdocs_gen_files.Nav()
+
+for path in sorted(Path(".").rglob("tsumugi/**/*.py")):
+    if "proto" in path.absolute().__str__():
+        # We do not need to expose generated code
+        continue
+    module_path = path.relative_to(".").with_suffix("")
+    doc_path = path.relative_to(".").with_suffix(".md")
+    full_doc_path = Path("python/reference", doc_path)
+
+    parts = tuple(module_path.parts)
+
+    if parts[-1] == "__init__":
+        parts = parts[:-1]
+        doc_path = doc_path.with_name("index.md")
+        full_doc_path = full_doc_path.with_name("index.md")
+    elif parts[-1] == "__main__":
+        continue
+
+    nav[parts] = doc_path.as_posix()  #
+
+    with mkdocs_gen_files.open(full_doc_path, "w") as fd:
+        ident = ".".join(parts)
+        fd.write(f"::: {ident}")
+
+    mkdocs_gen_files.set_edit_path(full_doc_path, path)
+
+with mkdocs_gen_files.open("python/reference/SUMMARY.md", "w") as nav_file:
+    nav_file.writelines(nav.build_literate_nav())
diff --git a/docs/index.md b/docs/index.md
@@ -6,3 +6,15 @@ A modern PySpark Connect/Classic wrapper on top of the Deequ, a beautiful Data Q
 ![](https://raw.githubusercontent.com/SemyonSinchenko/tsumugi-spark/main/static/tsumugi-spark-logo.png)
 
 **_NOTE:_** _Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga "Knights of Sidonia", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.)._
+
+## Table on contents
+
+- Lungauge agnostic concepts and usecases for Deequ
+    * [Concepts of Deequ](concepts/concepts.md)
+    * [Possible usecases for Deequ](concepts/usecases.md)
+- PySpark Connect / Classic API
+    * [Main data structures and classes](python/main_structures.md)
+    * [API Docs (auto-generated)](python/reference/SUMMARY.md)
+- Example notebooks
+    * [Basic example](notebooks/basic_example.ipynb)
+    * [Using predicates with analyzers](notebooks/using_where.ipynb)
-Original file line number
+Diff line change
@@ Expand Up / @@ -227,3 +227,6 @@ dev/spark-* @@
     # Spark
     tmp/*
+    # Auto-generated
+    docs/python/reference/*