Provide pip install for PySpark users #1079

andygrove · 2024-11-13T17:06:17Z

What is the problem the feature request solves?

As a PySpark user, I would like to be able to install Comet using pip.

Describe the potential solution

No response

Additional context

No response

SemyonSinchenko · 2024-11-13T18:38:01Z

The problem here is that a spark.plugins config is static, so all the plugins should be specified before the PySpark SparkSession is created. If users are going to run any comet job with spark-submit of the Python script, they should specify JARs location and plugins manually. And the process will be the same as submitting JVM Spark job. In my experience it is a very rare case when users create a SparkSession inside a Python script and run such a script as a Python file, not via spark-submit.

The same story applies to a Databricks (or Databricks-like) Python notebook, for example: SparkSession already exists inside, so because of the spark.plugins it is impossible to add Comet to the already running session. The only way to run Comet in Databricks is by using Databricks so called init scripts that are executed before the cluster is created. In this case, users should manually add the Comet JAR to the spark jars folder and specify the plugin.

Another case is PySpark connect, but in that case comet JARs should be on the server-side and any pip-installable package cannot help there anyhow...

UPD. If it is fine, I can create a simple pip-installable package that provides an executable command like spark-comet-submit (or something like this). Inside this command I can try to wrap a regular spark-submit command and add comet JARs to CP and configure the plugin. In that case Comet JARs can be a part of the python package or python package may contain the information about Maven coordinates of the Comet JARs.

andygrove · 2024-11-14T02:28:05Z

Thanks for the context @SemyonSinchenko.

@MrPowers fyi

MrPowers · 2024-11-14T14:58:01Z

@SemyonSinchenko and I will brainstorm this next week and report back on this issue.

SemyonSinchenko · 2024-11-27T16:07:36Z

We had a discussion with @MrPowers and that is the options we found:

The problem

All the "plugin" configs are static in Apache Spark and cannot be changed after the job is running;
Comet does not provide at the moment any additional functionality for end users;

Solution 1: fat python installation

In that case all the Comet's JARs are packed as resources for the python package with Spark itself JARs. spark-submit, pyspark, spark-shell, etc. are wrapped to run jobs on comet by default.

For example, pip install pycomet[spark35] will install:

pyspark 3.5.x
comet jars and resources
wrapped scripts for spark-submit, pyspark, spark-shell

Under the hood, for example, spark-submit will add Comet JARs to the CP and enable by default the plugin. Other parameters are passed to the spark-submit.

Solution 2: thin python installation

In that case pycomet is a tiny package that contains only the information about Maven coordinates of Comet JARs and, for example, comet-submit and pycomet commands that are again just a wrapper on top of spark's spark-submit. In that case, SPARK_HOME should be specified. In the runtime, comet-submit / pycomet will determine the version of the spark, add comet JARs from Maven Central to the CP and run the subit command with specified plugin.

Solution 3: just a helper

In that case pycomet contains only Maven coordinates.

For example, it may be done in the same way like it is done for python-deequ:

SPARK_TO_DEEQU_COORD_MAPPING = {
    "3.5": "com.amazon.deequ:deequ:2.0.7-spark-3.5",
    "3.3": "com.amazon.deequ:deequ:2.0.7-spark-3.3",
    "3.2": "com.amazon.deequ:deequ:2.0.7-spark-3.2",
    "3.1": "com.amazon.deequ:deequ:2.0.7-spark-3.1"
}


def _extract_major_minor_versions(full_version: str):
    major_minor_pattern = re.compile(r"(\d+\.\d+)\.*")
    match = re.match(major_minor_pattern, full_version)
    if match:
        return match.group(1)


@lru_cache(maxsize=None)
def _get_spark_version() -> str:
    try:
        spark_version = os.environ["SPARK_VERSION"]
    except KeyError:
        raise RuntimeError(f"SPARK_VERSION environment variable is required. Supported values are: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}")

    return _extract_major_minor_versions(spark_version)


def _get_deequ_maven_config():
    spark_version = _get_spark_version()
    try:
        return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
    except KeyError:
        raise RuntimeError(
            f"Found incompatible Spark version {spark_version}; Use one of the Supported Spark versions for Deequ: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}"
        )

Solution 4: all together

Running pip install pycomet will install only helpers and wrappers (pycomet, comet-submit);
Running pip install pycomet[spark35] will install helpers, wrappers and also all the comet resources + pyspark itself.

andygrove added the enhancement New feature or request label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide pip install for PySpark users #1079

Provide pip install for PySpark users #1079

andygrove commented Nov 13, 2024

SemyonSinchenko commented Nov 13, 2024 •

edited

Loading

andygrove commented Nov 14, 2024

MrPowers commented Nov 14, 2024

SemyonSinchenko commented Nov 27, 2024

Provide pip install for PySpark users #1079

Provide pip install for PySpark users #1079

Comments

andygrove commented Nov 13, 2024

What is the problem the feature request solves?

Describe the potential solution

Additional context

SemyonSinchenko commented Nov 13, 2024 • edited Loading

andygrove commented Nov 14, 2024

MrPowers commented Nov 14, 2024

SemyonSinchenko commented Nov 27, 2024

The problem

Solution 1: fat python installation

Solution 2: thin python installation

Solution 3: just a helper

Solution 4: all together

SemyonSinchenko commented Nov 13, 2024 •

edited

Loading