Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): support rollup, cube and grouping_sets APIs #9945

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Aug 27, 2024

Description of changes

WIP PR implementing support for rollup, cube and grouping sets.

For background on what these are used for and how they work, see
https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.

In discussing the implementation, I'm going to refer to groupings sets only,
but everything applies to rollups and cubes as well (rollups and cubes are
shorthand for a longer form GROUPING SET).

These concepts represent a new kind of "thing" for Ibis. They are neither
column nor scalar expressions that can be evaluated, nor are they tables
really.

The approach I took here was to add a non-Value operation for each construct.
These are desugared into tuples of tuples of Value expressions (because
multiple grouping sets specifications are allowed) wherever grouping sets are
supported.

One tricky thing is that I had to partition the key specifically requested in
group by, as distinct from those unique value expressions in grouping sets, to
support constructs like this:

SELECT a, b
FROM t
GROUP BY ROLLUP (a, b)

which is not equivalent to

SELECT a, b
FROM t
GROUP BY a, b, ROLLUP (a, b)

Right now, I'm looking for feedback on the approach before fleshing out the
test suite.

Issues closed

@cpcloud cpcloud added feature Features or general enhancements ux User experience related issues labels Aug 27, 2024
@cpcloud cpcloud marked this pull request as draft August 27, 2024 16:56
@gforsyth
Copy link
Member

This seems like a reasonable approach to me.
Small feedback that can always be addressed later:

I think it would be nice to allow passing in bare strings, values, or deferreds to grouping_sets, along with tuples of them, e.g.

ibis.grouping_sets(("a", "b"), "a", "b")

this would better mimic the SQL syntax (god, i can't believe I'm suggesting that...) and also avoids the awkward spelling of

ibis.grouping_sets(("a", "b"), ("a",), ("b",))

And on the naming front, I thing we should take a pointer from DuckDB and rename grouping to grouping_id -- I think grouping is going to cause confusion.

but overall seems solid!

@cpcloud
Copy link
Member Author

cpcloud commented Aug 28, 2024

All good points, definitely changing grouping -> grouping_id. I'm half tempted to call it group_id, those extra three characters are killing me.

@cpcloud
Copy link
Member Author

cpcloud commented Aug 28, 2024

For a full featured and correct implementation, grouping sets will depend on the earliest version of sqlglot contains tobymao/sqlglot#3985.

@cpcloud
Copy link
Member Author

cpcloud commented Aug 28, 2024

Ok, now that I think about this some more, I think having separate ibis.rollup/grouping_sets/cube objects adds unnecessary complication.

What do people think about making these keyword arguments to group_by? E.g.,

t.group_by(
    ...,
    rollup=(_.a, "b"),
    grouping_sets=("a", "b", ("a", "b")),
    cube=(_.c, _.d)
)

The separate objects would make sense if there were some way to use this functionality outside of a group by, but AFAICT they are strictly only valid in that case.

This would also eliminate having to do the partitioning, since the API requires the user spell it out.

@cpcloud
Copy link
Member Author

cpcloud commented Aug 28, 2024

This would break users who having columns named rollup, cube, or grouping_sets but that seems okay?

@jcrist
Copy link
Member

jcrist commented Aug 28, 2024

What do people think about making these keyword arguments to group_by?

This would break users who having columns named rollup, cube, or grouping_sets but that seems okay?

I slightly dislike that. Not because it might break existing users (seems unlikely), but because it's harder to visually distinguish keyword-arguments serving as aliases in group_by from those specific kwargs.

t.group_by(one=t.a, cube=(t.b,), three=t.c)  # intentionally terrible formatting

Another potential downside is it doesn't let users as easily use multiple of these constructs in the same query. Following this line from the docs:

If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping sets is the cross product of the individual items.

With separate objects, I'd to be able to pass in multiple of the same kind of object, while kwargs would force the user to handle the composition of them themselves (confession - I'm new to this whole concept, so apologies if this is a dumb example):

t.group_by(ibis.cube("a", "b"), ibis.cube("c", "d"))

Individual objects also may be easier for programmatic usage - a system would only need a single list of things to group by, rather than splitting out the various grouping set kind of things.

None of these are blockers of course, and all can be worked around. Just noting a few potential downsides.

@cpcloud
Copy link
Member Author

cpcloud commented Aug 28, 2024

Hm, all good points. I think I'll stick with the objects for now.

@cpcloud cpcloud force-pushed the rollup-cube-grouping-sets branch 4 times, most recently from 1719532 to 176adc8 Compare September 5, 2024 11:34
SELECT channel ,
id ,
SELECT nullif(channel, '') AS channel ,
nullif(id, '') AS id ,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change, and the same change in 80.sql are because of ClickHouse's obsession with never using null values. Apparently this includes the grouping keys for rollup and friends.

orderings: VarTuple[ops.SortKey] = ()
havings: VarTuple[ops.Value[dt.Boolean]] = ()

grouping_sets: VarTuple[VarTuple[VarTuple[ir.Value]]] = ()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me sad, and I'll add a comment.

ibis/expr/types/groupby.py Outdated Show resolved Hide resolved
return GroupedTable(self, groups)
groups, grouping_sets, rollups, cubes = partition_groups(*by, **key_exprs)
if not (groups or grouping_sets or rollups or cubes):
raise com.IbisInputError("No grouping keys provided")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we were using an Annotated type hint to enforce that groups was non-empty, but we can't do that anymore because it can be empty if the user provides at least one grouping set, rollup, or cube and we don't have a way to spell that kind of relationship in the type system (nor do I think we need to build that right now).

@@ -1040,6 +1069,9 @@ def aggregate(
metrics: Sequence[ir.Scalar] | None = (),
by: Sequence[ir.Value] | None = (),
having: Sequence[ir.BooleanValue] | None = (),
grouping_sets=(),
rollups=(),
cubes=(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like exposing these arguments in the aggregate. Here's why:

  1. I have to bind again, similar to what I am doing in the group_by method.
  2. Putting the grouping set arguments alongside by means there's potential for confusion and mixing: e.g., someone can passing ibis.cube into by, and/or they can pass it into cube. If we ultimately go this route then we'll have to handle that.

I suppose neither of these things is the actual worst, but I'm open to suggestions on how to avoid exposing these additional arguments.

Copy link
Member

@gforsyth gforsyth Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't get away from having to bind again (and actually adds a bit more work), but what if we just have all grouping expressions passed in to by?

agg(_.some_col.sum(), by=_.a, _.b, ibis.cube(_.a, _.b))

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll poke at it a bit. It might be easier if there were bound and unbound versions of the rollup/cube/grouping sets objects. Bound would be used internally only, while the user facing APIs are all unbound. The current ones are effectively Unbound.

@@ -1128,8 +1166,29 @@ def aggregate(
else:
metrics[metric.name] = metric

# construct the aggregate node
agg = ops.Aggregate(node, groups, metrics).to_expr()
keys = frozendict(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is far too optimistic and very proof-of-concept. More unit tests are required to throw some chaos at this code.

gs = ibis.cube("a", "b")
expr = t.group_by(gs).agg(n=_.count())
result = ibis.to_sql(expr, dialect="duckdb")
assert len(result)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are woefully naive and optimistic, need to make them grumpier.

@cpcloud cpcloud requested review from jcrist and gforsyth September 5, 2024 12:52
@cpcloud
Copy link
Member Author

cpcloud commented Sep 5, 2024

Keeping this as draft until I can at least add some less sophomoric tests.

@cpcloud cpcloud force-pushed the rollup-cube-grouping-sets branch from 86d21fa to 441ec7a Compare September 5, 2024 15:03
@cpcloud cpcloud linked an issue Sep 11, 2024 that may be closed by this pull request
@cpcloud cpcloud force-pushed the rollup-cube-grouping-sets branch from 441ec7a to 1c0553c Compare September 19, 2024 09:23
@cpcloud cpcloud added this to the 10.0 milestone Sep 19, 2024
@cpcloud cpcloud marked this pull request as ready for review September 19, 2024 10:32
@gforsyth
Copy link
Member

I wish I hadn't learned this while working through reviewing this, but I guess it's valid to pass a ROLLUP or a CUBE as an element of a GROUPING SET? I don't think that works right now, and maybe we leave that for a follow-up, but I thought we should note it down for ourselves.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 20, 2024

Which of course means you can nest grouping sets arbitrarily.

D create table t (a int, b int);
D select 1 from t group by grouping sets (rollup (a, b), a, grouping sets (cube (a, b)));
┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
│     1 │
└───────┘

@cpcloud
Copy link
Member Author

cpcloud commented Sep 20, 2024

This is a special level of hell.

@cpcloud cpcloud force-pushed the rollup-cube-grouping-sets branch 3 times, most recently from 91612a7 to 5f18749 Compare November 12, 2024 11:10
@github-actions github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Nov 12, 2024
@cpcloud cpcloud force-pushed the rollup-cube-grouping-sets branch 4 times, most recently from 394b5e8 to bf14a1e Compare November 14, 2024 12:52
@cpcloud cpcloud force-pushed the rollup-cube-grouping-sets branch from bf14a1e to e7e11a6 Compare December 20, 2024 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements sql Backends that generate SQL tests Issues or PRs related to tests ux User experience related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: grouping sets functionality
3 participants