Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide pip install for PySpark users #1079

Open
andygrove opened this issue Nov 13, 2024 · 4 comments
Open

Provide pip install for PySpark users #1079

andygrove opened this issue Nov 13, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

As a PySpark user, I would like to be able to install Comet using pip.

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added the enhancement New feature or request label Nov 13, 2024
@SemyonSinchenko
Copy link
Member

SemyonSinchenko commented Nov 13, 2024

The problem here is that a spark.plugins config is static, so all the plugins should be specified before the PySpark SparkSession is created. If users are going to run any comet job with spark-submit of the Python script, they should specify JARs location and plugins manually. And the process will be the same as submitting JVM Spark job. In my experience it is a very rare case when users create a SparkSession inside a Python script and run such a script as a Python file, not via spark-submit.

The same story applies to a Databricks (or Databricks-like) Python notebook, for example: SparkSession already exists inside, so because of the spark.plugins it is impossible to add Comet to the already running session. The only way to run Comet in Databricks is by using Databricks so called init scripts that are executed before the cluster is created. In this case, users should manually add the Comet JAR to the spark jars folder and specify the plugin.

Another case is PySpark connect, but in that case comet JARs should be on the server-side and any pip-installable package cannot help there anyhow...

UPD. If it is fine, I can create a simple pip-installable package that provides an executable command like spark-comet-submit (or something like this). Inside this command I can try to wrap a regular spark-submit command and add comet JARs to CP and configure the plugin. In that case Comet JARs can be a part of the python package or python package may contain the information about Maven coordinates of the Comet JARs.

@andygrove
Copy link
Member Author

Thanks for the context @SemyonSinchenko.

@MrPowers fyi

@MrPowers
Copy link

@SemyonSinchenko and I will brainstorm this next week and report back on this issue.

@SemyonSinchenko
Copy link
Member

We had a discussion with @MrPowers and that is the options we found:

The problem

  • All the "plugin" configs are static in Apache Spark and cannot be changed after the job is running;
  • Comet does not provide at the moment any additional functionality for end users;

Solution 1: fat python installation

In that case all the Comet's JARs are packed as resources for the python package with Spark itself JARs. spark-submit, pyspark, spark-shell, etc. are wrapped to run jobs on comet by default.

For example, pip install pycomet[spark35] will install:

  • pyspark 3.5.x
  • comet jars and resources
  • wrapped scripts for spark-submit, pyspark, spark-shell

Under the hood, for example, spark-submit will add Comet JARs to the CP and enable by default the plugin. Other parameters are passed to the spark-submit.

Solution 2: thin python installation

In that case pycomet is a tiny package that contains only the information about Maven coordinates of Comet JARs and, for example, comet-submit and pycomet commands that are again just a wrapper on top of spark's spark-submit. In that case, SPARK_HOME should be specified. In the runtime, comet-submit / pycomet will determine the version of the spark, add comet JARs from Maven Central to the CP and run the subit command with specified plugin.

Solution 3: just a helper

In that case pycomet contains only Maven coordinates.

For example, it may be done in the same way like it is done for python-deequ:

SPARK_TO_DEEQU_COORD_MAPPING = {
    "3.5": "com.amazon.deequ:deequ:2.0.7-spark-3.5",
    "3.3": "com.amazon.deequ:deequ:2.0.7-spark-3.3",
    "3.2": "com.amazon.deequ:deequ:2.0.7-spark-3.2",
    "3.1": "com.amazon.deequ:deequ:2.0.7-spark-3.1"
}


def _extract_major_minor_versions(full_version: str):
    major_minor_pattern = re.compile(r"(\d+\.\d+)\.*")
    match = re.match(major_minor_pattern, full_version)
    if match:
        return match.group(1)


@lru_cache(maxsize=None)
def _get_spark_version() -> str:
    try:
        spark_version = os.environ["SPARK_VERSION"]
    except KeyError:
        raise RuntimeError(f"SPARK_VERSION environment variable is required. Supported values are: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}")

    return _extract_major_minor_versions(spark_version)


def _get_deequ_maven_config():
    spark_version = _get_spark_version()
    try:
        return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
    except KeyError:
        raise RuntimeError(
            f"Found incompatible Spark version {spark_version}; Use one of the Supported Spark versions for Deequ: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}"
        )

Solution 4: all together

  • Running pip install pycomet will install only helpers and wrappers (pycomet, comet-submit);
  • Running pip install pycomet[spark35] will install helpers, wrappers and also all the comet resources + pyspark itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants