diff --git a/README.md b/README.md index 921414c..cebc6ba 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,9 @@ **_UNDER ACTIVE DEVELOPMENT_** -[![python-client](https://github.com/SemyonSinchenko/tsumugi-spark/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/SemyonSinchenko/tsumugi-spark/actions/workflows/ci.yml) +[![python-client](https://github.com/mrpowers-io/tsumugi-spark/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/mrpowers-io/tsumugi-spark/actions/workflows/ci.yml) -[Documentation](https://semyonsinchenko.github.io/tsumugi-spark/) +[Documentation](https://mrpowers-io.github.io/tsumugi-spark/)
@@ -49,7 +49,7 @@ From a high-level perspective, Tsumugi implements three main components: The diagram below provides an overview of this concept: -![](static/diagram.png) +![](https://raw.githubusercontent.com/mrpowers-io/tsumugi-spark/refs/heads/main/static/diagram.png) ## Project structure @@ -97,7 +97,7 @@ poetry env use python3.10 # any version bigger than 3.10 should work poetry install --with dev # that install tsumugi as well as jupyter notebooks and pyspark[connect] ``` -Now you can run jupyter and try the example notebook (`tsumugi-python/examples/basic_example.ipynb`): [Notebook](https://github.com/SemyonSinchenko/tsumugi-spark/blob/main/tsumugi-python/examples/basic_example.ipynb) +Now you can run jupyter and try the example notebook (`tsumugi-python/examples/basic_example.ipynb`): [Notebook](https://github.com/mrpowers-io/tsumugi-spark/blob/main/docs/notebooks/basic_example.ipynb) ### Server diff --git a/docs/dev/contributing.md b/docs/dev/contributing.md new file mode 100644 index 0000000..6593d87 --- /dev/null +++ b/docs/dev/contributing.md @@ -0,0 +1,115 @@ +# Contributing guide + +## Development environment + +This section provides a brief overview of the minimal developer environment. Because `tsumugi` is a complex multi-language project, this part is split into subsections covering the JVM server component, the protobuf and buf component, and the Python client component. + +### Makefile + +A portion of useful commands has been moved to the `Makefile`. To run `make` commands, one needs to download and install the make utility. The make utility can be obtained from the [GNU Make website](https://www.gnu.org/software/make/#download). For some operating systems, `make` is also available through package managers. + +### JVM + +For managing Java, Scala, and Maven, it is strongly recommended to use tools like [SDKMan](https://sdkman.io/). With SDKMan, all the necessary components can be easily installed using just a few shell commands. + +- Java 11; [OpenJDK archieve](https://jdk.java.net/archive/), [installation manual](https://openjdk.org/install/); +- Apache Maven; [Installation manual](https://maven.apache.org/install.html); +- Scala 2.12.X; [Installation manual](https://www.scala-lang.org/download/); + +Installation with `SDKMan`: + +```sh + sdk install java 11.0.25-zulu + sdk install maven 3.9.6 + sdk install scala 2.12.20 +``` + + +### Protobuf + +- [Protoc](https://github.com/protocolbuffers/protobuf); [Installation manual](https://grpc.io/docs/protoc-installation/); +- [Buf](https://github.com/bufbuild/buf); [Installation manual](https://buf.build/docs/installation/); + +Java classes are generated from the protobuf messages during the build via maven plugin and there is no reason to store them inside the repository. Python classes are generated from the protobuf messages manually. To generat or update Python classes run the following: + +```sh + make generate_code +``` + +### Python + +It is strongly recommend to use tools like [pyenv](https://github.com/pyenv/pyenv/tree/master) to manage Python versions. + +- [Python](https://www.python.org/downloads/release/python-3100/); with `pyenv`: `pyenv install 3.10`; +- [Poetry](https://python-poetry.org/); [Installation manual](https://python-poetry.org/docs/#installation); + +Run the following command to create a python venv: + +```sh + poetry env use %path-to-python3.10% # On POSIX systems it should be like ~/.pyenv/versions/3.10.14/bin/python + poetry install --with dev +``` + +## Code style and naming conventions + +### Linters + +Tsumugi uses [`scalafmt`](https://scalameta.org/scalafmt/) for server part and [`ruff`](https://github.com/astral-sh/ruff) for Python client. + +To apply formatting rules to scala part run the following command: + +```sh + mvn spotless:apply +``` + +To apply python formatting rules run the following command: + +```sh + make lint_python +``` + +### Python tips + +All the APIs built on top of ptorobuf classes are done in the following way: + +The protobuf message: + +```proto3 + message KLLParameters { + int32 sketch_size = 1; + double shrinking_factor = 2; + int32 number_of_buckets = 3; + } +``` + +The Python API: + +```python + @dataclass + class KLLParameters: + """Parameters for KLLSketch.""" + + sketch_size: int + shrinking_factor: float + number_of_buckets: int + + def _to_proto(self) -> proto.KLLSketch.KLLParameters: + return proto.KLLSketch.KLLParameters( + sketch_size=self.sketch_size, + shrinking_factor=self.shrinking_factor, + number_of_buckets=self.number_of_buckets, + ) +``` + +To maintain a consistent style across all Python code, the following rules should be followed: + +1. All classes that wrap code generated from protobuf messages should be implemented as dataclasses, unless there is a compelling reason not to do so. +2. Each of these classes should have a private method `_to_proto(self) -> "protobuf class"` that converts the dataclass to the proto-serializable class. + +## Running examples or testing clients + +To simplify testing and development, there is a script that builds a server plugin, downloads and unpacks the Spark distribution, and runs the Spark Connect Server with all the necessary configurations. To run it, use `make run_spark_server`. After that, the server will be available at `sc://localhost:15002`. + +## Examples and notebooks + +Examples and Notebooks are part of the documentation and are placed in `docs/notebooks` folder. diff --git a/docs/index.md b/docs/index.md index 2903f12..383a153 100644 --- a/docs/index.md +++ b/docs/index.md @@ -3,7 +3,7 @@ A modern PySpark Connect/Classic wrapper on top of the Deequ, a beautiful Data Quality library from AWS Labs. -![](https://raw.githubusercontent.com/SemyonSinchenko/tsumugi-spark/main/static/tsumugi-spark-logo.png) +![](https://raw.githubusercontent.com/mrpowers-io/tsumugi-spark/main/static/tsumugi-spark-logo.png) **_NOTE:_** _Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga "Knights of Sidonia", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.)._ diff --git a/mkdocs.yml b/mkdocs.yml index 5756898..c34bfc7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,7 +1,7 @@ site_name: Tsumugi -site_url: "https://semyonsinchenko.github.io/tsumugi-spark/" -repo_url: "https://github.com/semyonsinchenko/tsumugi-spark" -repo_name: "semyonsinchenko/tsumugi-spark" +site_url: "https://mrpowers-io.github.io/tsumugi-spark/" +repo_url: "https://github.com/mrpowers-io/tsumugi-spark" +repo_name: "mrpowers-io/tsumugi-spark" theme: name: material @@ -56,6 +56,8 @@ nav: - Example Notebooks: - 'Basic usage': "notebooks/basic_example.ipynb" - 'Using predicates': "notebooks/using_where.ipynb" + - Contributing: + - 'Dev environment & code style': "dev/contributing.md" markdown_extensions: - markdown_include.include: