feat(io): support way to ensure `read_*` apis create physical tables in the backend #9931

jcrist · 2024-08-26T17:59:37Z

Currently read_* methods (read_parquet/read_csv/...) may create a table or view, backend dependent. To the user executing code both should work, but the performance characteristics of each may differ depending on where the data actually lives. Likewise, a view may result in different results over time if the source files are updated by something else, while reading into the backend (and persisting it there) would avoid this issue.

To work around this, I've written code like:

t = ibis.read_parquet(...).cache()  # force doing the read now

For backends that create a view this is equivalent (though it leaves a lingering view around). For other backends though this will duplicate a physical table (until #6195 is implemented), doubling the data stored.

It would be nice to add a keyword argument to the read_* methods that allows forcing things to be a physical table.

For uniformity it might be nice to also have a way to force things to be a view, but:

not all backends support views, while all have a table concept
I personally don't have a use case for requiring that this generates a view instead of a table.

Really the distinction we're looking for is semantics of "is the data loaded into the backend, or is it a view on where it sits at rest", not necessarily the DDL terms the backend uses for Table/View. The difference in polars between read_* vs scan_*, for example.

Possible spellings:

# True means table, False means view (or something else), None is backend default?
# or True means table, False means backend default (maybe table, maybe something else?)
t = con.read_parquet(..., as_table=True)  # or `table=True`, `use_table=True`, `force_table=True`?

# Force a table if False, otherwise :shrug:.
# kinda dislike this one, but it is more precise
t = con.read_parquet(..., maybe_view=False)  # or `allow_view=False`?

t = con.read_parquet(..., type="table")  # or `type="view"`. I hate this one

The text was updated successfully, but these errors were encountered:

cpcloud · 2024-08-26T19:35:28Z

I think having separate methods makes sense for this, while keyword arguments clutter the API.

How about load_* equivalents that unconditionally persist data and mirror the read_* APIs?

cpcloud · 2024-08-26T19:38:26Z

I think best distinction we can do between any approach here is:

Things persist after the session ends
Things do not persist after the session ends

The former almost certainly requires physical storage (ruling out views), while the latter can be more less or whatever's well-suited to backend.

If people need something specific, they have to call create_{view,table}

jcrist · 2024-08-26T20:23:37Z

I think having separate methods makes sense for this, while keyword arguments clutter the API.

If these APIs are identical except for how the backend loads the data (a binary option), I'd argue that a keyword argument makes more sense than duplicate methods which do the same thing for most backends.

How about

# cache=False is whatever the backend currently does, cache=True always uses a table
t = con.read_parquet(..., cache=True)

How about load_* equivalents that unconditionally persist data and mirror the read_* APIs?

I hadn't considered the temporary/persistent choice, but I suppose that is another axis. Since our read_* methods generate a unique name by default (and anything that persists permanently should probably have an explicit name), I'm not sure if a binary modality here would make sense, and agree a new load_* method might make more sense (so we could better indicate and enforce that table_name is required).

In my case though, I don't really want a persistent table, but do want the load to happen only once for the lifetime of the session. In this case a simple kwarg (that really only needs to be implemented by the few backends that use views by default) seems like the simpler option.

jcrist · 2024-08-28T17:28:14Z

After talking this through, I think this issue would be better resolved through fixing #6195, and documenting this pattern for backends where it's relevant (mostly just duckdb). Closing.

github-project-automation bot added this to Ibis planning and roadmap Aug 26, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Aug 26, 2024

jcrist added ddl Issues related to creating or altering data definitions io Issues related to input and/or output feature Features or general enhancements ux User experience related issues labels Aug 26, 2024

jcrist mentioned this issue Aug 26, 2024

feat: make Table.cache() a no-op for tables that are already concrete in a backend #6195

Open

jcrist closed this as completed Aug 28, 2024

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Aug 28, 2024

jcrist mentioned this issue Aug 30, 2024

feat(api): avoid caching physical tables #9976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(io): support way to ensure `read_*` apis create physical tables in the backend #9931

feat(io): support way to ensure `read_*` apis create physical tables in the backend #9931

jcrist commented Aug 26, 2024

cpcloud commented Aug 26, 2024 •

edited

Loading

cpcloud commented Aug 26, 2024 •

edited

Loading

jcrist commented Aug 26, 2024

jcrist commented Aug 28, 2024

feat(io): support way to ensure read_* apis create physical tables in the backend #9931

feat(io): support way to ensure read_* apis create physical tables in the backend #9931

Comments

jcrist commented Aug 26, 2024

cpcloud commented Aug 26, 2024 • edited Loading

cpcloud commented Aug 26, 2024 • edited Loading

jcrist commented Aug 26, 2024

jcrist commented Aug 28, 2024

feat(io): support way to ensure `read_*` apis create physical tables in the backend #9931

feat(io): support way to ensure `read_*` apis create physical tables in the backend #9931

cpcloud commented Aug 26, 2024 •

edited

Loading

cpcloud commented Aug 26, 2024 •

edited

Loading