-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precise definition of equality of grouping columns in the aggregate relation #700
Comments
One solution is to come up with a more precise definition of expressions equality. That may be the best trade-off eventually. However, I have the feeling that that is relatively brittle and can be difficult to implement (at least for consumers, see below). I thus proposed an alternative design to the aggregate relation on Slack that I think we should consider. The idea is to avoid repeating equal expressions altogether and express the grouping sets as references into a list of grouping expressions. The relation would then look something like this: message AggregateRel {
// ...
// A list of one or more grouping expression sets that the aggregation measures should be calculated for.
// Required if there are no measures.
repeated Grouping groupings = 3;
// The grouping expressions that the grouping sets refer to.
repeated Expression grouping_expressions = 5; // <--- this is new
// ...
message Grouping {
// References into `AggregateRel.grouping_expressions`.
repeated int32 grouping_expressions = 1; // <--- this is now a reference instead of an `Expression`
}
// ...
} This avoids the definition of equality and, as a minor added benefit, compresses the plan in case of multiple grouping sets by avoiding redundant copies of grouping expressions. However, there are also a couple of downsides:
|
This was probably from a comment I made on Slack. I made the comment largely because most implementers start with a single grouping set (the most common case). Additionally, a lot of newer implementers might not even be aware of the concept of grouping sets. Having them normalize expressions in this way would be surprising for them. I don't think that is a showstopper but I thought it worth noting. |
I agree we need to be more specific on this. I don't think we need to have the most generous definition of equality. So to concretely answer your questions:
Maybe this "simple" definition of equality may be enough for the next several years?... |
One argument that came up on Slack (which seems to go into the direction of @jacques-n's comment) is that Substraits high-level philosophy is to "allow for dumb consumers" (see this slide deck). In the context of the current issue, one could argue that sparing consumers to deduplicate expressions would allow to make them simpler, and that, if we have to decide whether producers or consumers need to do that, we'd rather give the producers this responsibility. |
This may be more complex than it appears on a first look. A first (solvable) issue is the fact that part of message AdvancedExtension {
// An optimization is helpful information that don't influence semantics. May
// be ignored by a consumer.
repeated google.protobuf.Any optimization = 1;
// ...
} The more complex issue appears here and in quite a number of other places I hadn't previously thought of: how do we compare two I looked a bit into I hate to be a pain in the neck here but I hope it is in everybody's interest to bring up this issue now rather than running into incompatibilities later on, when the spec may be widely adopted and considered final. |
I thought about this a bit more. The gravity of this problem is somewhat lower than I thought because, in many cases where
However, there are a few instances where it would actually be nice if consumers not understanding the
TL;DR: We can probably classify each of these into "can be ignored" and "cannot be ignored" but I maintain my argument that this makes consumers more complex than necessary. |
Yeah, good point. One of the painful things here is that I don't see any clean way to support backwards compatibility if we want to move to the new representation. We might as well just introduce AggregatelRel2 or something as repeated fields don't work well with oneofs. (It begs the question of whether moving forward we might want to always wrap repeated fields in a message to avoid this kind of compatibility pain.) I think a lot of systems have implemented AggregatelRel already so I don't think we can do a hard breaking changes here like we might be able to do on edge cases that are rarely implemented. |
We could repurpose groupings to be the ordered list of grouping expressions that we refer to and add a GroupingDetail? message that (if present) is the new grouping information with indexes to the expressions. It would be ugly to require that the groupings only contain one expression though. |
It'd be awkward since groupings is a two dimensional array of expressions and we need a one dimensional array. |
This PR changes the definition of grouping sets in `AggregateRel` to consist of references into a list of grouping expressions instead of consisting of expressions directly. As discussed in more detail in substrait-io#700, the previous version is problematic because it requires consumers to deduplicate these expressions, which, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional. The new version avoids that problem and, thus, allows consumers to be simpler. Signed-off-by: Ingo Müller <[email protected]>
BREAKING CHANGE: This PR changes the definition of grouping sets in `AggregateRel` to consist of references into a list of grouping expressions instead of consisting of expressions directly. As discussed in more detail in substrait-io#700, the previous version is problematic because it requires consumers to deduplicate these expressions, which, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional. The new version avoids that problem and, thus, allows consumers to be simpler. Signed-off-by: Ingo Müller <[email protected]>
There seems to be an general openess to following my proposal -- if there are still arguments against it please let me know. I thus went ahead and made a concrete proposal: #706. I think it avoids the problem that @EpsilonPrime and @jacques-n have been discussing. |
BREAKING CHANGE: This PR changes the definition of grouping sets in `AggregateRel` to consist of references into a list of grouping expressions instead of consisting of expressions directly. As discussed in more detail in substrait-io#700, the previous version is problematic because it requires consumers to deduplicate these expressions, which, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional. The new version avoids that problem and, thus, allows consumers to be simpler. Signed-off-by: Ingo Müller <[email protected]>
BREAKING CHANGE: This PR changes the definition of grouping sets in `AggregateRel` to consist of references into a list of grouping expressions instead of consisting of expressions directly. As discussed in more detail in substrait-io#700, the previous version is problematic because it requires consumers to deduplicate these expressions, which, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional. The new version avoids that problem and, thus, allows consumers to be simpler. Signed-off-by: Ingo Müller <[email protected]>
BREAKING CHANGE: This PR changes the definition of grouping sets in `AggregateRel` to consist of references into a list of grouping expressions instead of consisting of expressions directly. With the previous definition, consumers had to deduplicate the expressions in the grouping sets in order to execute the query or even derive the output schema (which is problematic, as explained below). With this change, the responsibility of deduplicating expressions is now on the producer. Concretely, consumers are now expected to be simpler: The list of grouping expressions immediately provides the information needed to derive the output schema and the list of grouping sets explicitly and unambiguously provides the equality of grouping expressions. Producers now have to specify the grouping sets explicitly. If their internal representation of grouping sets consists of full grouping expressions (rather than references), then they must deduplicate these expressions according to their internal notion of expression equality in order to produce grouping sets consisting of references to these deduplicated expressions. If the previous format is desired, it can be obtained from the new format by (1) deduplicating the grouping expressions (according to the previously applicable definition of expression equality), (2) re-establishing the duplicates using the emit clause, and (3) "dereferencing" the references in the grouping sets, i.e., by replacing each reference in the grouping sets with the expression it refers to. The previous version was problematic because it required the *consumers* to deduplicate the expressions from the grouping sets. This, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional, which is in opposition to the general philosophy of allowing for simple-minded consumers. The new version avoids that problem and, thus, allows consumers to be simpler. The issues are discussed in even more detail in #700. --------- Signed-off-by: Ingo Müller <[email protected]> Co-authored-by: Weston Pace <[email protected]>
This issue summarizes two threads in the Slack channel, which will be behind a paywall at some point.
The aggregate relation uses a definition of expression equality that affects its semantics, including its output schema. The current definition is ambiguous and based on protobuf message equality. The first is a problem because it makes the precise semantics of the relation ambiguous; the second is problematic because the protobuf definition should be derived from the spec and not the other way around.
This is the relevant part of the spec (emphasis mine):
To elaborate the problem, let's consider an example. Let's say we have the following three grouping sets, where
a
andb
are two arbitrary but different expressions:Then the set of grouping expressions will be
(a, b)
and the output schema will have a column for each each ofa
andb
.The challenge arises during the deserialization of the protobuf messages. At that point, we do not have
a
andb
yet, but four different messages:In order to get to the representation above, we need to establish that
m1 == m3
,m2 == m4
, and !(m1 == m2)
. For different definitions of==
, we may either come to the interpretation above, i.e., that the grouping set expressions is(m1, m2)
, or any other subset of(m1, m2, m3, m4)
, most of which have different semantics and, thus, break the common understanding between systems that don't agree on==
.The current formulation of the spec allows for at least the following interpretations:
MessageDifferencer
implements.There are still multiple questions with this approach:
Note that expressions may contain sub-queries so pretty much all of the spec is affected and we have to come up with a definition that works unambiguously for all of that.
For reference, #92/#100 introduced the support for arbitrary grouping expressions instead of just field references. This simplified producers because it allowed them to store the grouping expressions directly in the aggregate relation rather than having to emit a project relation with the expressions and refer to those. The spec was then updated in #258/#260 to reflect the grouping expressions and introduced the current definition of equality.
The text was updated successfully, but these errors were encountered: