Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(io): support way to ensure read_* apis create physical tables in the backend #9931

Closed
jcrist opened this issue Aug 26, 2024 · 4 comments
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements io Issues related to input and/or output ux User experience related issues

Comments

@jcrist
Copy link
Member

jcrist commented Aug 26, 2024

Currently read_* methods (read_parquet/read_csv/...) may create a table or view, backend dependent. To the user executing code both should work, but the performance characteristics of each may differ depending on where the data actually lives. Likewise, a view may result in different results over time if the source files are updated by something else, while reading into the backend (and persisting it there) would avoid this issue.

To work around this, I've written code like:

t = ibis.read_parquet(...).cache()  # force doing the read now

For backends that create a view this is equivalent (though it leaves a lingering view around). For other backends though this will duplicate a physical table (until #6195 is implemented), doubling the data stored.

It would be nice to add a keyword argument to the read_* methods that allows forcing things to be a physical table.

For uniformity it might be nice to also have a way to force things to be a view, but:

  • not all backends support views, while all have a table concept
  • I personally don't have a use case for requiring that this generates a view instead of a table.

Really the distinction we're looking for is semantics of "is the data loaded into the backend, or is it a view on where it sits at rest", not necessarily the DDL terms the backend uses for Table/View. The difference in polars between read_* vs scan_*, for example.

Possible spellings:

# True means table, False means view (or something else), None is backend default?
# or True means table, False means backend default (maybe table, maybe something else?)
t = con.read_parquet(..., as_table=True)  # or `table=True`, `use_table=True`, `force_table=True`?

# Force a table if False, otherwise :shrug:.
# kinda dislike this one, but it is more precise
t = con.read_parquet(..., maybe_view=False)  # or `allow_view=False`?

t = con.read_parquet(..., type="table")  # or `type="view"`. I hate this one
@jcrist jcrist added ddl Issues related to creating or altering data definitions io Issues related to input and/or output feature Features or general enhancements ux User experience related issues labels Aug 26, 2024
@cpcloud
Copy link
Member

cpcloud commented Aug 26, 2024

I think having separate methods makes sense for this, while keyword arguments clutter the API.

How about load_* equivalents that unconditionally persist data and mirror the read_* APIs?

@cpcloud
Copy link
Member

cpcloud commented Aug 26, 2024

I think best distinction we can do between any approach here is:

  1. Things persist after the session ends
  2. Things do not persist after the session ends

The former almost certainly requires physical storage (ruling out views), while the latter can be more less or whatever's well-suited to backend.

If people need something specific, they have to call create_{view,table}

@jcrist
Copy link
Member Author

jcrist commented Aug 26, 2024

I think having separate methods makes sense for this, while keyword arguments clutter the API.

If these APIs are identical except for how the backend loads the data (a binary option), I'd argue that a keyword argument makes more sense than duplicate methods which do the same thing for most backends.

How about

# cache=False is whatever the backend currently does, cache=True always uses a table
t = con.read_parquet(..., cache=True)

How about load_* equivalents that unconditionally persist data and mirror the read_* APIs?

I hadn't considered the temporary/persistent choice, but I suppose that is another axis. Since our read_* methods generate a unique name by default (and anything that persists permanently should probably have an explicit name), I'm not sure if a binary modality here would make sense, and agree a new load_* method might make more sense (so we could better indicate and enforce that table_name is required).

In my case though, I don't really want a persistent table, but do want the load to happen only once for the lifetime of the session. In this case a simple kwarg (that really only needs to be implemented by the few backends that use views by default) seems like the simpler option.

@jcrist
Copy link
Member Author

jcrist commented Aug 28, 2024

After talking this through, I think this issue would be better resolved through fixing #6195, and documenting this pattern for backends where it's relevant (mostly just duckdb). Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements io Issues related to input and/or output ux User experience related issues
Projects
Archived in project
Development

No branches or pull requests

4 participants
@cpcloud @jcrist and others