-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide pip install for PySpark users #1079
Comments
The problem here is that a The same story applies to a Databricks (or Databricks-like) Python notebook, for example: SparkSession already exists inside, so because of the Another case is PySpark connect, but in that case comet JARs should be on the server-side and any pip-installable package cannot help there anyhow... UPD. If it is fine, I can create a simple pip-installable package that provides an executable command like |
Thanks for the context @SemyonSinchenko. @MrPowers fyi |
@SemyonSinchenko and I will brainstorm this next week and report back on this issue. |
We had a discussion with @MrPowers and that is the options we found: The problem
Solution 1: fat python installationIn that case all the Comet's JARs are packed as resources for the python package with Spark itself JARs. For example,
Under the hood, for example, Solution 2: thin python installationIn that case Solution 3: just a helperIn that case For example, it may be done in the same way like it is done for python-deequ: SPARK_TO_DEEQU_COORD_MAPPING = {
"3.5": "com.amazon.deequ:deequ:2.0.7-spark-3.5",
"3.3": "com.amazon.deequ:deequ:2.0.7-spark-3.3",
"3.2": "com.amazon.deequ:deequ:2.0.7-spark-3.2",
"3.1": "com.amazon.deequ:deequ:2.0.7-spark-3.1"
}
def _extract_major_minor_versions(full_version: str):
major_minor_pattern = re.compile(r"(\d+\.\d+)\.*")
match = re.match(major_minor_pattern, full_version)
if match:
return match.group(1)
@lru_cache(maxsize=None)
def _get_spark_version() -> str:
try:
spark_version = os.environ["SPARK_VERSION"]
except KeyError:
raise RuntimeError(f"SPARK_VERSION environment variable is required. Supported values are: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}")
return _extract_major_minor_versions(spark_version)
def _get_deequ_maven_config():
spark_version = _get_spark_version()
try:
return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
except KeyError:
raise RuntimeError(
f"Found incompatible Spark version {spark_version}; Use one of the Supported Spark versions for Deequ: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}"
) Solution 4: all together
|
What is the problem the feature request solves?
As a PySpark user, I would like to be able to install Comet using pip.
Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: