Support more DML statements #209

zerodarkzone · 2024-11-27T21:45:44Z

Hi,
Right now we could say that sqlframe only support the INSERT statement

I know that spark by itself does not support this kind of statements but I think they would be really usefull.
Also I don't think that supporting that would be out of scope for a library like this because at the end the idea of sqlframe is to run using a SQL Database as the backend and these are usual operations done in a Database/Datawarehouse.

Libraries like delta-lake and Snowpark support these kind of operations so they could be a good starting point to define a good API for this.

eakmanrq · 2024-11-28T02:18:37Z

So the idea then would be to create a "SQLFrame Table" object for each engine that you could then call operations like update and merge on? It is a cool idea and would expand the functionality of SQLFrame. Is this something you are interested in contributing?

zerodarkzone · 2024-11-28T12:41:28Z

Yes,
Snowpark has a Table class which is simply an extension of the DataFrame class with those extra functions. Would you prefer to create a Table class that inherits from DataFrame or just add the functionality directly to the DataFrame classes?
I'm interested in contributing (I would not able to test it on Redshift).

zerodarkzone · 2024-11-28T12:53:05Z

Hi,
Just thinking a little more about it. I think is better to create a separate class. A DataFrame can represent any SQL query but a Table should represent a table inside a DB.

eakmanrq · 2024-11-28T21:57:57Z

Yeah this is how I was thinking about it.

Want to take a pass at it but just target DuckDB at first? The goal would be to get aligned on the interface of the class and how the DataFrame and Table API work together. I tend to prototype things on DuckDB because of the speed/convenience of it.

zerodarkzone · 2024-11-30T05:25:32Z

Hi,

I'll start working on it with DuckDB and Databricks (This is the platform I use for work) and I'll try to get an API aligned to what you have right now.

For now, my idea is to create a Table class which inherits from DataFrame with the new functionallity.
I'm also thinking on changing the session.read.table("database.table") function to return a Table instead of a DataFrame. Every function called on the Table class different than a DML will convert the table to a DataFrame so you can only use the DML operation on an actual table.

eakmanrq · 2024-11-30T17:14:53Z

Ok that sounds interesting. So the thinking is that although it now returns a table it will maintain API compatibility by converting to a DataFrame if the following method call is not part of the table API. I'm also open to creating new API endpoints for this (like session.sqlframe.table) but I see how your approach would make it feel more natively integrated.

zerodarkzone · 2024-12-11T22:52:04Z

Hi,

I'm working on this on my fork. Could you take a look?

Creating a completely separate API is also possible.

zerodarkzone#3

zerodarkzone · 2024-12-11T23:03:55Z

This is how it is working:

When you use the function session.read.table it returns a table object.

df_employee = session.read.table("employee")
print(type(df_employee))

<class 'sqlframe.databricks.table.DatabricksTable'>

When you use any other function (except for alias), it gets converted to a dataframe.

print(type(df_employee.select("*")))

<class 'sqlframe.databricks.dataframe.DatabricksDataFrame'>

The new functions work something like this:
update

# Lazy operation
update = df_employee.update(
    _set={"age": df_employee["age"] + 1},
    where=df_employee["id"] == 1,
)
# Execute the update
update.execute()

delete

# Lazy operation
delete= df_employee.delete(
    where=df_employee["age"] > 100,
)
# Execute the delete
delete.execute()

merge

new_df = df_employee.where(F.col("age") > 40)

# Lazy operation
merge = df_employee.merge(
    new_df,
    (df_employee["id"] == new_df["id"]) & F.lit(True),
    [
        WhenMatched(condition=(df_employee["id"] == new_df["id"])).update(
            set_={"age": df_employee["age"] + new_df["age"] + new_df["age"] + 1}
        ),
        WhenNotMatched().insert(
            values={
                "id": new_df["id"],
                "fname": new_df["fname"],
                "lname": F.col("lname"),
                "age": new_df["age"] + 1,
                "store_id": F.col("store_id"),
                "active": F.lit(True),
            }
        ),
        WhenNotMatchedBySource().delete(),
    ],
)
# Execute the merge
merge.execute()

eakmanrq · 2024-12-13T05:26:57Z

Very cool. I really like how it maintains compatibility with the PySpark API while also creating a clean interface for doing these table operations. Nice job! 👍

I looked over your fork and overall it looks great. Will certainly dig more into the details once you are ready to create a PR. One thing I did notice is that it currently doesn't have any tests. Is that something you plan on doing before you submit the PR?

zerodarkzone · 2024-12-13T16:11:22Z

Hi,

Yes, adding some tests is something I plan to do.

eakmanrq added the enhancement New feature or request label Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support more DML statements #209

Support more DML statements #209

zerodarkzone commented Nov 27, 2024

eakmanrq commented Nov 28, 2024

zerodarkzone commented Nov 28, 2024

zerodarkzone commented Nov 28, 2024

eakmanrq commented Nov 28, 2024

zerodarkzone commented Nov 30, 2024

eakmanrq commented Nov 30, 2024

zerodarkzone commented Dec 11, 2024

zerodarkzone commented Dec 11, 2024

eakmanrq commented Dec 13, 2024

zerodarkzone commented Dec 13, 2024

Support more DML statements #209

Support more DML statements #209

Comments

zerodarkzone commented Nov 27, 2024

eakmanrq commented Nov 28, 2024

zerodarkzone commented Nov 28, 2024

zerodarkzone commented Nov 28, 2024

eakmanrq commented Nov 28, 2024

zerodarkzone commented Nov 30, 2024

eakmanrq commented Nov 30, 2024

zerodarkzone commented Dec 11, 2024

zerodarkzone commented Dec 11, 2024

eakmanrq commented Dec 13, 2024

zerodarkzone commented Dec 13, 2024