-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(io): support way to ensure read_*
apis create physical tables in the backend
#9931
Comments
I think having separate methods makes sense for this, while keyword arguments clutter the API. How about |
I think best distinction we can do between any approach here is:
The former almost certainly requires physical storage (ruling out views), while the latter can be more less or whatever's well-suited to backend. If people need something specific, they have to call |
If these APIs are identical except for how the backend loads the data (a binary option), I'd argue that a keyword argument makes more sense than duplicate methods which do the same thing for most backends. How about # cache=False is whatever the backend currently does, cache=True always uses a table
t = con.read_parquet(..., cache=True)
I hadn't considered the temporary/persistent choice, but I suppose that is another axis. Since our In my case though, I don't really want a persistent table, but do want the load to happen only once for the lifetime of the session. In this case a simple kwarg (that really only needs to be implemented by the few backends that use views by default) seems like the simpler option. |
After talking this through, I think this issue would be better resolved through fixing #6195, and documenting this pattern for backends where it's relevant (mostly just duckdb). Closing. |
Currently
read_*
methods (read_parquet
/read_csv
/...) may create a table or view, backend dependent. To the user executing code both should work, but the performance characteristics of each may differ depending on where the data actually lives. Likewise, a view may result in different results over time if the source files are updated by something else, while reading into the backend (and persisting it there) would avoid this issue.To work around this, I've written code like:
For backends that create a view this is equivalent (though it leaves a lingering view around). For other backends though this will duplicate a physical table (until #6195 is implemented), doubling the data stored.
It would be nice to add a keyword argument to the
read_*
methods that allows forcing things to be a physical table.For uniformity it might be nice to also have a way to force things to be a view, but:
Really the distinction we're looking for is semantics of "is the data loaded into the backend, or is it a view on where it sits at rest", not necessarily the DDL terms the backend uses for Table/View. The difference in polars between
read_*
vsscan_*
, for example.Possible spellings:
The text was updated successfully, but these errors were encountered: