-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: change grouping expressions in AggregateRel to references #706
feat: change grouping expressions in AggregateRel to references #706
Conversation
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
64e9e71
to
edc8487
Compare
This is still work in progress. The text part of the specification still needs to be adapted. But I guess the protobuf definition allows to agree on what our desired end goal is? |
ea3f49d
to
8f58598
Compare
I'm supportive of this change although it makes my stomach a little sick. Thoughts @westonpace @EpsilonPrime @vbarua ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need a description of why and how to use the grouping expressions on the website but the approach is soundly backwards compatible --fix newer consumers will do the right thing for older plans. Producers may duplicate the old grouping expressions with the references to work with older consumers too.
I'm +1 on the change but I think the release note/commit message needs to better clarify what needs to separate what was done versus the breaking change part. In the breaking change part we should be describing what specifically needs to be done for migration and what is actually breaking. |
Thanks, all, for being supporting and providing feedback! I have added a new commit that propagates the changes to the textual specification and changed/extended the commit message. Please let me know if those changes reflect what you had in mind. |
BREAKING CHANGE: This PR changes the definition of grouping sets in `AggregateRel` to consist of references into a list of grouping expressions instead of consisting of expressions directly. As discussed in more detail in substrait-io#700, the previous version is problematic because it requires consumers to deduplicate these expressions, which, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional. The new version avoids that problem and, thus, allows consumers to be simpler. Signed-off-by: Ingo Müller <[email protected]>
This is in line with other repeated fields and was suggested by @tokoko. Signed-off-by: Ingo Müller <[email protected]>
Adapt the textual specification of `AgregateRel` in the specification, which appears on the web site. In particular: * Introduce the list of grouping expressions and define grouping sets as references into that list. * Specify that each grouping expression must be referred to by at least one grouping set. * Remove the mentioning of expression equality, which is now not needed anymore. The commit also makes two non-semantic changes in the same paragraphs that aim to clarify the text: * Rephrase the explanation on when two records are folded into the same group to be more clear. * Mention explicitly that one empty grouping set is equivalent to not having any grouping set. Signed-off-by: Ingo Müller <[email protected]>
9ada3c8
to
b782814
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. While this does add a little complexity, I think it removes a lot of ambiguity and avoids the need to specify expression equality which IMO is much harder.
@@ -345,13 +345,13 @@ The aggregate operation groups input data on one or more sets of grouping keys, | |||
| Inputs | 1 | | |||
| Outputs | 1 | | |||
| Property Maintenance | Maintains distribution if all distribution fields are contained in every grouping set. No orderedness guaranteed. | | |||
| Direct Output Order | The list of distinct columns from each grouping set (ordered by their first appearance) followed by the list of measures in declaration order, followed by an `i32` describing the associated particular grouping set the value is derived from (if applicable). | | |||
| Direct Output Order | The list of grouping expressions in declaration order followed by the list of measures in declaration order, followed by an `i32` describing the associated particular grouping set the value is derived from (if applicable). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An added benefit of this is that it makes it much easier to compute the output order. Declaration order is very clear ✨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might actually be easier to understand than the original formulation (though maybe it's just because I know more now)
Co-authored-by: Weston Pace <[email protected]>
9e0e320
(I accepted @westonpace's suggestion to fix a typo -- now I need a new approval to be able to merge 😉) |
Given that we had at least 2 +1's before fixing the minor typo plus my vote I think we have 3 +1 votes and am going to merge. Will revert if someone feels I misinterpreted everyone's intent. |
Thanks for your patience here @ingomueller-net , really appreciate the contribution and updates. |
Awesome, great to see this being wrapped up. Thank you all for the good discussions! |
BREAKING CHANGE: This PR changes the definition of grouping sets in
AggregateRel
to consist of references into a list of grouping expressions instead of consisting of expressions directly.With the previous definition, consumers had to deduplicate the expressions in the grouping sets in order to execute the query or even derive the output schema (which is problematic, as explained below). With this change, the responsibility of deduplicating expressions is now on the producer. Concretely, consumers are now expected to be simpler: The list of grouping expressions immediately provides the information needed to derive the output schema and the list of grouping sets explicitly and unambiguously provides the equality of grouping expressions. Producers now have to specify the grouping sets explicitly. If their internal representation of grouping sets consists of full grouping expressions (rather than references), then they must deduplicate these expressions according to their internal notion of expression equality in order to produce grouping sets consisting of references to these deduplicated expressions.
If the previous format is desired, it can be obtained from the new format by (1) deduplicating the grouping expressions (according to the previously applicable definition of expression equality), (2) re-establishing the duplicates using the emit clause, and (3) "dereferencing" the references in the grouping sets, i.e., by replacing each reference in the grouping sets with the expression it refers to.
The previous version was problematic because it required the consumers to deduplicate the expressions from the grouping sets. This, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional, which is in opposition to the general philosophy of allowing for simple-minded consumers. The new version avoids that problem and, thus, allows consumers to be simpler. The issues are discussed in even more detail in #700.