Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: 'JavaPackage' object is not callable #242

Open
rish-shar opened this issue May 27, 2024 · 4 comments
Open

Error: 'JavaPackage' object is not callable #242

rish-shar opened this issue May 27, 2024 · 4 comments

Comments

@rish-shar
Copy link

Description

I have two PySpark dataframes, source_df and target_df. I ran pip install pyspark-extension to install diff.

Spark Version - 3.4.1
Scala Version - 2.12

When I run source_df.diff(target_df), I get the below error -

TypeError                                 Traceback (most recent call last)
File <command-2426417243632400>, line 1
----> 1 source_df.diff(target_df, )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:427, in diff(self, other, *id_columns)
    367 def diff(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFrame:
    368     """
    369     Returns a new DataFrame that contains the differences between this and the other DataFrame.
    370     Both DataFrames must contain the same set of column names and data types.
   (...)
    425     :rtype DataFrame
    426     """
--> 427     return Differ().diff(self, other, *id_columns)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:337, in Differ.diff(self, left, right, *id_columns)
    274 """
    275 Returns a new DataFrame that contains the differences between the two DataFrames.
    276 
   (...)
    334 :rtype DataFrame
    335 """
    336 jvm = left._sc._jvm
--> 337 jdiffer = self._to_java(jvm)
    338 jdf = jdiffer.diff(left._jdf, right._jdf, _to_seq(jvm, list(id_columns)))
    339 return DataFrame(jdf, left.session_or_ctx())

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:270, in Differ._to_java(self, jvm)
    269 def _to_java(self, jvm: JVMView) -> JavaObject:
--> 270     jdo = self._options._to_java(jvm)
    271     return jvm.uk.co.gresearch.spark.diff.Differ(jdo)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:245, in DiffOptions._to_java(self, jvm)
    235 def _to_java(self, jvm: JVMView) -> JavaObject:
    236     return jvm.uk.co.gresearch.spark.diff.DiffOptions(
    237         self.diff_column,
    238         self.left_column_prefix,
    239         self.right_column_prefix,
    240         self.insert_diff_value,
    241         self.change_diff_value,
    242         self.delete_diff_value,
    243         self.nochange_diff_value,
    244         jvm.scala.Option.apply(self.change_column),
--> 245         self.diff_mode._to_java(jvm),
    246         self.sparse_mode,
    247         self.default_comparator._to_java(jvm),
    248         self._to_java_map(jvm, self.data_type_comparators, key_to_java=self._to_java_data_type),
    249         self._to_java_map(jvm, self.column_name_comparators)
    250     )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/gresearch/spark/diff/__init__.py:37, in DiffMode._to_java(self, jvm)
     36 def _to_java(self, jvm: JVMView) -> JavaObject:
---> 37     return jvm.uk.co.gresearch.spark.diff.DiffMode.withNameOption(self.name).get()

TypeError: 'JavaPackage' object is not callable

Any help would be appreciated.

@liteart
Copy link

liteart commented Jun 13, 2024

The python pip package does only contain the stubs for code completion. Spark requires the java package to be installed (the python package is not necessary on Databricks).

Add a Maven Library and pass uk.co.gresearch.spark:spark-extension_2.13:2.12.0-3.5 as maven package, and the extension will load as expected.

@rish-shar
Copy link
Author

The python pip package does only contain the stubs for code completion. Spark requires the java package to be installed (the python package is not necessary on Databricks).

Add a Maven Library and pass uk.co.gresearch.spark:spark-extension_2.13:2.12.0-3.5 as maven package, and the extension will load as expected.

@liteart How do I achieve this on Databricks? Do I need to add the package at cluster level then?

@EnricoMi EnricoMi changed the title diff on Databricks gives error: 'JavaPackage' object is not callable Error: 'JavaPackage' object is not callable Jun 14, 2024
@EnricoMi EnricoMi pinned this issue Jun 14, 2024
@EnricoMi
Copy link
Contributor

Add a Maven Library and pass uk.co.gresearch.spark:spark-extension_2.13:2.12.0-3.5 as maven package, ...

In your setup (Scala 2.12, Spark 3.4.1), this should be uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.4.

github-merge-queue bot pushed a commit that referenced this issue Aug 16, 2024
Provides a meaningful error message when user accesses a spark extension
function in Python that requires the Java / Scala package:

RuntimeError: Java / Scala package not found! You need to add the Maven
spark-extension package to your PySpark environment:
https://github.com/G-Research/spark-extension#python

Before, the error was:

    TypeError: 'JavaPackage' object is not callable

Improves #242
Supersedes #244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants