-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Security exceptions when working with databrick unity catalog #2459
Comments
Incidentally enough, just yesterday I was talking to Databricks about this, and it's because Splink uses custom jars that aren't supported on shared mode/serverless clusters. They told me it is on the roadmap. The current way around it, other than redeveloping Splink to remove the custom jars, is to use a single-user cluster on Databricks |
@aamir-rj I'm afraid we don't have access to databricks so can't really help out with these sorts of errors. There are various discussions of similar issues that may (but may not) help: Thanks @fscholes!. Incidentally, If you ever have chance to mention it to databricks, it'd be great if they could simply add a couple of (fairly simple) functions into DataBricks itself so the custom jar was no longer needed. it causes people a lot of hassle due to these security issues. The full list of custom udfs is here: But probably jaro-winkler would be sufficient for the vast majority of users! |
Are you aware Databricks has an ARC module, based on Splink? |
Thanks. Yes. As far as I know, it doesn't get around this jar problem (but I would love to be corrected on that!) |
I'm aware of ARC, but I don't think it's been actively developed for quite some time now |
Thanks for replies
I spoke to databricks and they asked to run on single user cluster which
works for fine.
Thanks
…On Thu, 10 Oct 2024 at 11:18 AM fscholes ***@***.***> wrote:
Incidentally enough, just yesterday I was talking to Databricks about
this, and it's because Splink uses custom jars that aren't supported on
shared mode/serverless clusters. They told me it is on the roadmap.
The current way around it, other than redeveloping Splink to remove the
custom jars, is to use a single-user cluster on Databricks
—
Reply to this email directly, view it on GitHub
<#2459 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2FVYNUKZ5DAFHGGRBPPVILZ2YS6BAVCNFSM6AAAAABPT5M3RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBUGI2TOOJSHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Another possible way around this for anyone using splink >= 3.9.10 is an option to opt-out of registering the custom jars. In Splink 4 this looks like: ...
spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)
linker = Linker(df, settings, db_api=spark_api)
... |
I used the below option but always gets this error
nam alt is not defined.
Tried installing alt but still same issue
![Image](https://github.com/user-attachments/assets/d7f152da-9289-4592-b369-28b674816a02)
…On Mon, 14 Oct 2024 at 1:05 PM ADBond ***@***.***> wrote:
Another possible way around this for anyone using splink >= 3.9.10 is an
option to opt-out of registering the custom jars. In Splink 4 this looks
like:
...spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)linker = Linker(df, settings, db_api=spark_api)
...
See issue #1744
<#1744> and
option added in #1774
<#1774>.
—
Reply to this email directly, view it on GitHub
<#2459 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2FVYNSXAYMJKLKVMOFJNDLZ3OCPLAVCNFSM6AAAAABPT5M3RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJQGUZDENRUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
From the image it looks like you are mixing Splink 2 ( If you are not able to upgrade then you will need to stick to the single user cluster workaround. |
Nope ,I used splink 4 only.
…On Wed, 16 Oct 2024 at 2:22 PM ADBond ***@***.***> wrote:
I used the below option but always gets this error nam alt is not defined.
Tried installing alt but still same issue image.png (view on web)
<https://github.com/user-attachments/assets/d7f152da-9289-4592-b369-28b674816a02>
From the image it looks like you are mixing Splink 2 (splink.Splink) and
Splink 4 (splink.SparkAPI) code. The option register_udfs_automatically
is only available in Splink 4 and later versions of Splink 3 - there is no
equivalent in Splink 2.
If you are able to upgrade then I would recommend moving to Splink 4, as
then all of the documentation
<https://moj-analytical-services.github.io/splink/getting_started.html>
will be applicable.
If you are not able to upgrade then you will need to stick to the single
user cluster workaround.
—
Reply to this email directly, view it on GitHub
<#2459 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2FVYNT5UQNFSMVPGBF4RTDZ3Y44RAVCNFSM6AAAAABPT5M3RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWGM3TKNRUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What I mean is that the code you appear to be running is not valid Splink 4 code - there is no longer an object from splink import Linker, SparkAPI
...
spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)
linker = Linker(df, settings, db_api=spark_api)
df_dedupe_result = linker.inference.predict() As to your error, it appears that you do not have the requirement |
@aamir-rj do you have It should install automatically with splink as it is a required dependency - but if not for some reason you can run |
@aamir-rj and anyone else who stumbles across this error, I've had the same problem with importing
|
What happens?
When working on cluster which are in shared mode on unity catalog, splink throws py security exceptions
To Reproduce
OS:
databricks
Splink version:
pip install splink==2.1.14
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: