Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more DML statements #209

Open
zerodarkzone opened this issue Nov 27, 2024 · 10 comments
Open

Support more DML statements #209

zerodarkzone opened this issue Nov 27, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@zerodarkzone
Copy link
Contributor

Hi,
Right now we could say that sqlframe only support the INSERT statement

I know that spark by itself does not support this kind of statements but I think they would be really usefull.
Also I don't think that supporting that would be out of scope for a library like this because at the end the idea of sqlframe is to run using a SQL Database as the backend and these are usual operations done in a Database/Datawarehouse.

Libraries like delta-lake and Snowpark support these kind of operations so they could be a good starting point to define a good API for this.

@eakmanrq
Copy link
Owner

So the idea then would be to create a "SQLFrame Table" object for each engine that you could then call operations like update and merge on? It is a cool idea and would expand the functionality of SQLFrame. Is this something you are interested in contributing?

@zerodarkzone
Copy link
Contributor Author

Yes,
Snowpark has a Table class which is simply an extension of the DataFrame class with those extra functions. Would you prefer to create a Table class that inherits from DataFrame or just add the functionality directly to the DataFrame classes?
I'm interested in contributing (I would not able to test it on Redshift).

@zerodarkzone
Copy link
Contributor Author

Hi,
Just thinking a little more about it. I think is better to create a separate class. A DataFrame can represent any SQL query but a Table should represent a table inside a DB.

@eakmanrq
Copy link
Owner

Yeah this is how I was thinking about it.

Want to take a pass at it but just target DuckDB at first? The goal would be to get aligned on the interface of the class and how the DataFrame and Table API work together. I tend to prototype things on DuckDB because of the speed/convenience of it.

@zerodarkzone
Copy link
Contributor Author

Hi,

I'll start working on it with DuckDB and Databricks (This is the platform I use for work) and I'll try to get an API aligned to what you have right now.

For now, my idea is to create a Table class which inherits from DataFrame with the new functionallity.
I'm also thinking on changing the session.read.table("database.table") function to return a Table instead of a DataFrame. Every function called on the Table class different than a DML will convert the table to a DataFrame so you can only use the DML operation on an actual table.

@eakmanrq
Copy link
Owner

Ok that sounds interesting. So the thinking is that although it now returns a table it will maintain API compatibility by converting to a DataFrame if the following method call is not part of the table API. I'm also open to creating new API endpoints for this (like session.sqlframe.table) but I see how your approach would make it feel more natively integrated.

@eakmanrq eakmanrq added the enhancement New feature or request label Dec 5, 2024
@zerodarkzone
Copy link
Contributor Author

Hi,

I'm working on this on my fork. Could you take a look?

Creating a completely separate API is also possible.

zerodarkzone#3

@zerodarkzone
Copy link
Contributor Author

This is how it is working:

When you use the function session.read.table it returns a table object.

df_employee = session.read.table("employee")
print(type(df_employee))
<class 'sqlframe.databricks.table.DatabricksTable'>

When you use any other function (except for alias), it gets converted to a dataframe.

print(type(df_employee.select("*")))
<class 'sqlframe.databricks.dataframe.DatabricksDataFrame'>

The new functions work something like this:
update

# Lazy operation
update = df_employee.update(
    _set={"age": df_employee["age"] + 1},
    where=df_employee["id"] == 1,
)
# Execute the update
update.execute()

delete

# Lazy operation
delete= df_employee.delete(
    where=df_employee["age"] > 100,
)
# Execute the delete
delete.execute()

merge

new_df = df_employee.where(F.col("age") > 40)

# Lazy operation
merge = df_employee.merge(
    new_df,
    (df_employee["id"] == new_df["id"]) & F.lit(True),
    [
        WhenMatched(condition=(df_employee["id"] == new_df["id"])).update(
            set_={"age": df_employee["age"] + new_df["age"] + new_df["age"] + 1}
        ),
        WhenNotMatched().insert(
            values={
                "id": new_df["id"],
                "fname": new_df["fname"],
                "lname": F.col("lname"),
                "age": new_df["age"] + 1,
                "store_id": F.col("store_id"),
                "active": F.lit(True),
            }
        ),
        WhenNotMatchedBySource().delete(),
    ],
)
# Execute the merge
merge.execute()

@eakmanrq
Copy link
Owner

Very cool. I really like how it maintains compatibility with the PySpark API while also creating a clean interface for doing these table operations. Nice job! 👍

I looked over your fork and overall it looks great. Will certainly dig more into the details once you are ready to create a PR. One thing I did notice is that it currently doesn't have any tests. Is that something you plan on doing before you submit the PR?

@zerodarkzone
Copy link
Contributor Author

Hi,

Yes, adding some tests is something I plan to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants