-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: Add plan_anchor field to PlanRel #725
Comments
This sounds cool and potentially related to something I'm working on.
Do you mean merging plans is straightforward with respect to not having to compare function semantics?
How would you define the same type of anchor for a PlanRel? It seems easy to think that a function extension URI maps 1-1 with a particular semantic definition of a function. I'm not sure how to map that to PlanRels |
In my own project, I use Discussion of splitting and merging aside, the If that sounds similar to what you're doing, then I think this is a great idea. I can explain the details of how I do it and find some way to upstream it. If you're trying to do something more general, then I'd be interested in hearing more about your use case. |
What I mean by a merge is to use two potentially independent Rels in the same parent Rel (for example a JoinRel or a SetRel). Normally a query like the following:
Note that the plan has two entries in the list of relations, the first being a plain Rel (representing the contents of a CTE) and another RelRoot that uses ReferenceRel{ subtree_ordinal = 0 } to refer to the first entry in the list. The problem arises when you want to join two subtrees that already independently contain CTEs:
Each plan already uses a ReferenceRel{ subtree_ordinal = 0 } but actually pointing to different CTEs, the merged version of the plan can obviously only put one of them in the 0th position of the The alternative would be to have a |
Unless I'm misunderstanding, I think you should be able to accomplish this w/o introducing PlanAnchor message type (even w/o changes requested in this ticket). Instead of a PlanAnchor you can put the Rel in the list of relations and refer to it by ReferenceRel instead of a PlanAnchor. |
I think this would only work if I still keep all Rels in the plan. I was actually reducing the substrait message itself to only contain the Rels to be executed. Maybe I can demote the PlanRel.RelRoot to a normal Rel and promote the desired Rel to a new PlanRel.RelRoot and mark it so that a consumer doesn't think it's a CTE then that could work? But, that also seems wasteful in terms of payload size and complexity. I can think on it though |
but how would you do this in the general case? I guess you could maybe make a hash that accommodates operators and schemas (essentially an identifier of a subgraph in a plan), but even if you do that, wouldn't you have to verify there were no collisions? |
For a general case, you can't really do anything, there can always be a "collision". (the same applies for function references as well) For a specific scenario when all the plans used are being incrementally built in the same process (converting from other representation or building with a dataframe API) you can simply set up a system-wide unique integer generator. That's how we handle function references in ibis-substrait producer now for example. |
Also, this could potentially be addressed by keeping a list of |
yeah, so I get what the idea is and I wonder if it should just be called something other than an anchor since it's a general identity property that you can use an extrinsic approach for (UUID generator) or an intrinsic approach for (subgraph identity as a hash). I am interested in supporting materialized views so it would be valuable for me to reuse this type of property for an intrinsic identity value |
So I can see how referencing by id instead of index might make merging easier. It might be useful to look at what other operations are happening when you're merging plans for context. For instance, when merging happens you also have to reorganize function references and function URIs (which is even more involved as you have to walk all the expressions in the merged tree instead of the relations). |
@EpsilonPrime to be exact, even for plan references, you need to walk all the expressions in all the merged trees as well, because you might have to update subtree_ordinal fields in Let me give you a basic rundown of what's happening during a merge wrt function references. There are 2 general scenarios:
For the first scenario, there are no easy escape hatches, you have to at some point (probably right before merge) check anchors (inside For the second scenario, the easy solution is to set up a single central registry which will be handing out unique references to functions to all the plans being created even if they don't have anything to do with one another. This is to guarantee that in case you need a merge, there will be no collisions and therefore no need to walk all the subtrees. You simply need to merge extensions and extension_uris and call it a day. This is roughly what's happening both in The point of this ticket is that the solution in the second scenario is possible only for |
This adds an "anchor" to `PlanRel` that can be referenced from a `ReferenceRel`. This anchor and reference relationship provides a non-ordinal method for identifying and accessing a "subtree" or sub-graph. This commit leaves in the original `subtree_ordinal` attribute since it seems a (mildly) more performant method for referencing a subtree, but also since it is still relevant in the typical case. The new anchor improves cases where multiple plans are merged and at least one already contains a `ReferenceRel`. It is expected that only one of `subtree_ordinal` or `subtree_reference` will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in a `oneof` constraint. Issue: substrait-io#725
I created #726 so I could get a sense of what this might look like and how it affects my custom |
The name "PlanAnchor" collides with the concept of "anchor" used in substrait. Renaming of the message is to avoid conflict with work on anchors for non-ordinal plan references. Related: substrait-io/substrait#725
I'm working on a library that lets you build up substrait plans using dataframe API. Over the course of dataframe transformations various unrelated plans need to be merged to construct the final substrait plan. For the most part it's pretty straightforward, but you need to deal with extension functions and references during the merges.
Functions in substrait fortunately let you define "pointers" in a relaxed fashion with anchors. This means that as long as there is some global way of assigning a unique identifier to each function beforehand, merging plans become straightforward. Unlike functions, relationship between PlanRels and RefereceRels is modeled differently, RefereceRels need to be aware of the ordinal position of the target rel in the relations array. This makes merging plans very tricky. One has to either traverse the whole plan to adjust ordinal positions on each merge or use some sort of placeholders and do a single-pass traversal at the end.
To make this simpler, substrait could use the same anchor mechanism with PlanRels. An additional
plan_anchor
field can be introduced in PlanRel message, which will be used by ReferenceRel instead of relying on ordinal positions.The text was updated successfully, but these errors were encountered: