From dc3c9fb77367dd0e42c0647685b9e936a342c2b2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ingo=20M=C3=BCller?= Date: Fri, 13 Sep 2024 08:44:22 +0000 Subject: [PATCH 1/4] [WIP] feat: change grouping expressions in AggregateRel to references. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BREAKING CHANGE: This PR changes the definition of grouping sets in `AggregateRel` to consist of references into a list of grouping expressions instead of consisting of expressions directly. As discussed in more detail in #700, the previous version is problematic because it requires consumers to deduplicate these expressions, which, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional. The new version avoids that problem and, thus, allows consumers to be simpler. Signed-off-by: Ingo Müller --- proto/substrait/algebra.proto | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/proto/substrait/algebra.proto b/proto/substrait/algebra.proto index eed7bcc79..847114d49 100644 --- a/proto/substrait/algebra.proto +++ b/proto/substrait/algebra.proto @@ -244,18 +244,30 @@ message AggregateRel { // Input of the aggregation Rel input = 2; - // A list of one or more grouping expression sets that the aggregation measures should be calculated for. - // Required if there are no measures. + // A list of zero or more grouping sets that the aggregation measures should + // be calculated for. There must be at least one grouping set if there are no + // measures (but it can be the empty grouping set). repeated Grouping groupings = 3; // A list of one or more aggregate expressions along with an optional filter. // Required if there are no groupings. repeated Measure measures = 4; + // A list of zero or more grouping expressions that grouping sets (i.e., + // `Grouping` messages in the `groupings` field) can reference. Each + // expression in this list must be referred to by at least one + // `Grouping.expression_reference`. + repeated Expression grouping_expressions = 5; + substrait.extensions.AdvancedExtension advanced_extension = 10; message Grouping { - repeated Expression grouping_expressions = 1; + // Deprecated in favor of `expression_reference` below. + repeated Expression grouping_expressions = 1 [deprecated = true]; + + // A list of zero or more references to grouping expressions, i.e., indices + // into the `grouping_expression` list. + repeated uint32 expression_reference = 2; } message Measure { From 5a77e22379fcbb8582ee0ec0e66b943c1cd87c33 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ingo=20M=C3=BCller?= Date: Fri, 13 Sep 2024 13:58:31 +0000 Subject: [PATCH 2/4] Make `grouping_expressions` plural. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This is in line with other repeated fields and was suggested by @tokoko. Signed-off-by: Ingo Müller --- proto/substrait/algebra.proto | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/proto/substrait/algebra.proto b/proto/substrait/algebra.proto index 847114d49..5c4b9c8ac 100644 --- a/proto/substrait/algebra.proto +++ b/proto/substrait/algebra.proto @@ -256,18 +256,18 @@ message AggregateRel { // A list of zero or more grouping expressions that grouping sets (i.e., // `Grouping` messages in the `groupings` field) can reference. Each // expression in this list must be referred to by at least one - // `Grouping.expression_reference`. + // `Grouping.expression_references`. repeated Expression grouping_expressions = 5; substrait.extensions.AdvancedExtension advanced_extension = 10; message Grouping { - // Deprecated in favor of `expression_reference` below. + // Deprecated in favor of `expression_references` below. repeated Expression grouping_expressions = 1 [deprecated = true]; // A list of zero or more references to grouping expressions, i.e., indices // into the `grouping_expression` list. - repeated uint32 expression_reference = 2; + repeated uint32 expression_references = 2; } message Measure { From b7828141aa1b1d95a903ee43fd740eeaf76142cb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ingo=20M=C3=BCller?= Date: Mon, 23 Sep 2024 08:23:15 +0000 Subject: [PATCH 3/4] docs: reflect changes in textual specification/web site MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adapt the textual specification of `AgregateRel` in the specification, which appears on the web site. In particular: * Introduce the list of grouping expressions and define grouping sets as references into that list. * Specify that each grouping expression must be referred to by at least one grouping set. * Remove the mentioning of expression equality, which is now not needed anymore. The commit also makes two non-semantic changes in the same paragraphs that aim to clarify the text: * Rephrase the explanation on when two records are folded into the same group to be more clear. * Mention explicitly that one empty grouping set is equivalent to not having any grouping set. Signed-off-by: Ingo Müller --- site/docs/relations/logical_relations.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index d71b57d2f..9a51117c5 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -345,13 +345,13 @@ The aggregate operation groups input data on one or more sets of grouping keys, | Inputs | 1 | | Outputs | 1 | | Property Maintenance | Maintains distribution if all distribution fields are contained in every grouping set. No orderedness guaranteed. | -| Direct Output Order | The list of distinct columns from each grouping set (ordered by their first appearance) followed by the list of measures in declaration order, followed by an `i32` describing the associated particular grouping set the value is derived from (if applicable). | +| Direct Output Order | The list of grouping expressions in declaration order followed by the list of measures in declaration order, followed by an `i32` describing the associated particular grouping set the value is derived from (if applicable). | In its simplest form, an aggregation has only measures. In this case, all records are folded into one, and a column is returned for each aggregate expression in the measures list. -Grouping sets can be used for finer-grained control over which records are folded. Within a grouping set, two records will be folded together if and only if each expressions in the grouping set yields the same value for each. The values returned by the grouping sets will be returned as columns to the left of the columns for the aggregate expressions. If a grouping set contains no grouping expressions, all rows will be folded for that grouping set. +Grouping sets can be used for finer-grained control over which records are folded. A grouping set consists of zero or more references to the list of grouping expressions. Within a grouping set, two records will be folded together if and only if they have the same values for each of the expressions in the grouping set. The values returned by the grouping expressions will be returned as columns to the left of the columns for the aggregate expressions. Each of the grouping expressions must occur in at least one of the grouping sets. If a grouping set contains no grouping expressions, all rows will be folded for that grouping set. (Having a single grouping set with no grouping expressions is thus equivalent to not haveing any grouping sets.) -It's possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. Two grouping expressions will be returned using the same column if the protobuf messages describing the expressions are equal. The columns for grouping expressions that do *not* appear in *all* grouping sets will be nullable (regardless of the nullability of the type returned by the grouping expression) to accomodate the null insertion. +It is possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. The columns for grouping expressions that do *not* appear in *all* grouping sets will be nullable (regardless of the nullability of the type returned by the grouping expression) to accomodate the null insertion. To further disambiguate which record belongs to which grouping set, an aggregate relation with more than one grouping set receives an extra `i32` column on the right-hand side. The value of this field will be the zero-based index of the grouping set that yielded the record. From 9e0e32003178794377ed0242388a8dc0d5c4b2c9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ingo=20M=C3=BCller?= Date: Fri, 27 Sep 2024 09:51:42 +0200 Subject: [PATCH 4/4] Fix typo found by @westonpace Co-authored-by: Weston Pace --- site/docs/relations/logical_relations.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index 9a51117c5..4d6a1f9d3 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -349,7 +349,7 @@ The aggregate operation groups input data on one or more sets of grouping keys, In its simplest form, an aggregation has only measures. In this case, all records are folded into one, and a column is returned for each aggregate expression in the measures list. -Grouping sets can be used for finer-grained control over which records are folded. A grouping set consists of zero or more references to the list of grouping expressions. Within a grouping set, two records will be folded together if and only if they have the same values for each of the expressions in the grouping set. The values returned by the grouping expressions will be returned as columns to the left of the columns for the aggregate expressions. Each of the grouping expressions must occur in at least one of the grouping sets. If a grouping set contains no grouping expressions, all rows will be folded for that grouping set. (Having a single grouping set with no grouping expressions is thus equivalent to not haveing any grouping sets.) +Grouping sets can be used for finer-grained control over which records are folded. A grouping set consists of zero or more references to the list of grouping expressions. Within a grouping set, two records will be folded together if and only if they have the same values for each of the expressions in the grouping set. The values returned by the grouping expressions will be returned as columns to the left of the columns for the aggregate expressions. Each of the grouping expressions must occur in at least one of the grouping sets. If a grouping set contains no grouping expressions, all rows will be folded for that grouping set. (Having a single grouping set with no grouping expressions is thus equivalent to not having any grouping sets.) It is possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. The columns for grouping expressions that do *not* appear in *all* grouping sets will be nullable (regardless of the nullability of the type returned by the grouping expression) to accomodate the null insertion.