Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal contributing guide #42

Merged
merged 2 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

**_UNDER ACTIVE DEVELOPMENT_**

[![python-client](https://github.com/SemyonSinchenko/tsumugi-spark/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/SemyonSinchenko/tsumugi-spark/actions/workflows/ci.yml)
[![python-client](https://github.com/mrpowers-io/tsumugi-spark/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/mrpowers-io/tsumugi-spark/actions/workflows/ci.yml)

[Documentation](https://semyonsinchenko.github.io/tsumugi-spark/)
[Documentation](https://mrpowers-io.github.io/tsumugi-spark/)

<p align="center">
<img src="https://raw.githubusercontent.com/SemyonSinchenko/tsumugi-spark/main/static/tsumugi-spark-logo.png" alt="tsumugi-shiraui" width="600" align="middle"/>
Expand Down Expand Up @@ -49,7 +49,7 @@ From a high-level perspective, Tsumugi implements three main components:

The diagram below provides an overview of this concept:

![](static/diagram.png)
![](https://raw.githubusercontent.com/mrpowers-io/tsumugi-spark/refs/heads/main/static/diagram.png)

## Project structure

Expand Down
115 changes: 115 additions & 0 deletions docs/dev/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Contributing guide

## Development environment

This section provides a brief overview of the minimal developer environment. Because `tsumugi` is a complex multi-language project, this part is split into subsections covering the JVM server component, the protobuf and buf component, and the Python client component.

### Makefile

A portion of useful commands has been moved to the `Makefile`. To run `make` commands, one needs to download and install the make utility. The make utility can be obtained from the [GNU Make website](https://www.gnu.org/software/make/#download). For some operating systems, `make` is also available through package managers.

### JVM

For managing Java, Scala, and Maven, it is strongly recommended to use tools like [SDKMan](https://sdkman.io/). With SDKMan, all the necessary components can be easily installed using just a few shell commands.

- Java 11; [OpenJDK archieve](https://jdk.java.net/archive/), [installation manual](https://openjdk.org/install/);
- Apache Maven; [Installation manual](https://maven.apache.org/install.html);
- Scala 2.12.X; [Installation manual](https://www.scala-lang.org/download/);

Installation with `SDKMan`:

```sh
sdk install java 11.0.25-zulu
sdk install maven 3.9.6
sdk install scala 2.12.20
```


### Protobuf

- [Protoc](https://github.com/protocolbuffers/protobuf); [Installation manual](https://grpc.io/docs/protoc-installation/);
- [Buf](https://github.com/bufbuild/buf); [Installation manual](https://buf.build/docs/installation/);

Java classes are generated from the protobuf messages during the build via maven plugin and there is no reason to store them inside the repository. Python classes are generated from the protobuf messages manually. To generat or update Python classes run the following:

```sh
make generate_code
```

### Python

It is strongly recommend to use tools like [pyenv](https://github.com/pyenv/pyenv/tree/master) to manage Python versions.

- [Python](https://www.python.org/downloads/release/python-3100/); with `pyenv`: `pyenv install 3.10`;
- [Poetry](https://python-poetry.org/); [Installation manual](https://python-poetry.org/docs/#installation);

Run the following command to create a python venv:

```sh
poetry env use %path-to-python3.10% # On POSIX systems it should be like ~/.pyenv/versions/3.10.14/bin/python
poetry install --with dev
```

## Code style and naming conventions

### Linters

Tsumugi uses [`scalafmt`](https://scalameta.org/scalafmt/) for server part and [`ruff`](https://github.com/astral-sh/ruff) for Python client.

To apply formatting rules to scala part run the following command:

```sh
mvn spotless:apply
```

To apply python formatting rules run the following command:

```sh
make lint_python
```

### Python tips

All the APIs built on top of ptorobuf classes are done in the following way:

The protobuf message:

```proto3
message KLLParameters {
int32 sketch_size = 1;
double shrinking_factor = 2;
int32 number_of_buckets = 3;
}
```

The Python API:

```python
@dataclass
class KLLParameters:
"""Parameters for KLLSketch."""

sketch_size: int
shrinking_factor: float
number_of_buckets: int

def _to_proto(self) -> proto.KLLSketch.KLLParameters:
return proto.KLLSketch.KLLParameters(
sketch_size=self.sketch_size,
shrinking_factor=self.shrinking_factor,
number_of_buckets=self.number_of_buckets,
)
```

To maintain a consistent style across all Python code, the following rules should be followed:

1. All classes that wrap code generated from protobuf messages should be implemented as dataclasses, unless there is a compelling reason not to do so.
2. Each of these classes should have a private method `_to_proto(self) -> "protobuf class"` that converts the dataclass to the proto-serializable class.

## Runnign examples or testing clients
SemyonSinchenko marked this conversation as resolved.
Show resolved Hide resolved

To simplify testing and development, there is a script that builds a server plugin, downloads and unpacks the Spark distribution, and runs the Spark Connect Server with all the necessary configurations. To run it, use `make run_spark_server`. After that, the server will be available at `sc://localhost:15002`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make run_spark_server results in a build error - [ERROR] /workspaces/tsumugi-spark/tsumugi-server/src/test/scala/com/ssinchenko/tsumugi/DeequUtilsTest.scala:66: type mismatch; found : scala.util.Try[com.amazon.deequ.VerificationRunBuilder] required: com.amazon.deequ.VerificationRunBuilder [ERROR] one error found

Changing the statement val deequSuite = DeequSuiteBuilder.protoToVerificationSuite(data, protoSuiteBuilder.build()) to val deequSuite = DeequSuiteBuilder.protoToVerificationSuite(data, protoSuiteBuilder.build()).getOrElse(throw new RuntimeException("Failed to create VerificationSuite")) in /workspaces/tsumugi-spark/tsumugi-server/src/test/scala/com/ssinchenko/tsumugi/DeequUtilsTest.scala and adding relevant imports resolved it, however my Scala is no good. Please have a look and use your judgement!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I just realized that I somehow missed this in previous PR and we actually do not have automated testing for scala server. Will create a separate Issue to tackle this. Since it's not really related to this PR I will merge it in

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to fix the problem in this PR but we also need to run some minimal python client tests in CI

Copy link
Collaborator

@zeotuan zeotuan Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko Sorry Did not get updated on this. I figured we already have #40 which might be a good candidate for this?


## Examples and notebooks

Examples and Notebooks are part of the documentation and are placed in `docs/notebooks` folder.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
A modern PySpark Connect/Classic wrapper on top of the Deequ, a beautiful Data Quality library from AWS Labs.


![](https://raw.githubusercontent.com/SemyonSinchenko/tsumugi-spark/main/static/tsumugi-spark-logo.png)
![](https://raw.githubusercontent.com/mrpowers-io/tsumugi-spark/main/static/tsumugi-spark-logo.png)

**_NOTE:_** _Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga "Knights of Sidonia", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.)._

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR but On line 12 there is a typo Lungauge can we include it in this PR

Expand Down
8 changes: 5 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
site_name: Tsumugi
site_url: "https://semyonsinchenko.github.io/tsumugi-spark/"
repo_url: "https://github.com/semyonsinchenko/tsumugi-spark"
repo_name: "semyonsinchenko/tsumugi-spark"
site_url: "https://mrpowers-io.github.io/tsumugi-spark/"
repo_url: "https://github.com/mrpowers-io/tsumugi-spark"
repo_name: "mrpowers-io/tsumugi-spark"

theme:
name: material
Expand Down Expand Up @@ -56,6 +56,8 @@ nav:
- Example Notebooks:
- 'Basic usage': "notebooks/basic_example.ipynb"
- 'Using predicates': "notebooks/using_where.ipynb"
- Contributing:
- 'Dev environment & code style': "dev/contributing.md"

markdown_extensions:
- markdown_include.include:
Expand Down