Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case Study to design logical TableFunction operator #753

Open
drin opened this issue Dec 6, 2024 · 1 comment
Open

Case Study to design logical TableFunction operator #753

drin opened this issue Dec 6, 2024 · 1 comment

Comments

@drin
Copy link
Member

drin commented Dec 6, 2024

I am interested in helping to design the spec for TableFunction in substrait and am hoping to track my own development in a way that provides a reference for discussion. As I mentioned in #745 (comment), I will eventually try to have an implementation of 3 different types of table functions and explore what feels natural and what feels like an antipattern (if anyone already has strong opinions feel free to share, I have very little experience).

The 3 types of table functions:

  1. A TableFunction acting as a leaf operator, e.g. scan_arrow_ipc in duckdb.
  2. A TableFunction acting as a transformation of an input table to an output table. For this, I want to implement something a bit unique: a function that maintains the cardinality of the table (output rows == input rows) but applies a function across the columns, thereby changing the schema (like a projection, e.g. [, , , ..., ] -> [, ].
  3. A TableFunction acting like a fused operator, e.g. "GroupJoin" as in Accelerating Queries with... Join by GroupJoin.

I'll add more information to this issue as I prototype, if anyone has recommendations on a different way to track or reference the work, let me know and I can adjust as we like.

@jacques-n
Copy link
Contributor

jacques-n commented Dec 10, 2024

I think about it more like two distinct behaviors as opposed to three.

A. Generator table function. Takes in only constant arguments and produces 0..N records. Operates as leaf in a tree.
B. Set-based table function: takes in a set of records and adds one or more additional columns to each.

For your type 2, I think of that as a window function which possibly excludes certain (or all) input columns. I guess the one distinction is you want a window function that returns multiple output values...

Type A (your type 1) feels pretty simple and that we have most of the low-level concepts to build against. I could see:

  • Add a new table function extension type.
  • Declare input arguments for table function as a collection of constant arguments.
  • Define the output to be a collection of output columns/fields rather than a single column

Type B (likely matches your type 3), we need to likely introduce a lateral operator or similar. Can you remind me how some tools like Calcite represent this? The lateral would accept a table function extension type that could be constants or field references. First use would probably be FLATTEN/UNNEST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants