-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement join
operation
#32
base: main
Are you sure you want to change the base?
Conversation
We need indeed all of the enum values that are also in the protobuf spec:
Each of them is a distinct case. For example, a "left" join is really a "left outer" join (which returns all matching joins partners plus all unmatched tuples from the left input), which is different from a "left anti" join (which returns all tuples from the left input that do not have a matching partner in the right input). |
@@ -474,6 +474,36 @@ def Substrait_FilterOp : Substrait_RelOp<"filter", [ | |||
}]; | |||
} | |||
|
|||
def JoinTypeKind : I32EnumAttr<"JoinTypeKind", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move to enums file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't have an enums file. the SetOpKind Enum is also defined in this file. Should I do a PR that creates a SubstraitEnums.td file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, OK! My long-pending PR for aggregate
creates one. Let's move the other enums once that is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets do that 👍
); | ||
let results = (outs Substrait_Relation:$result); | ||
let assemblyFormat = [{ | ||
$join_type $left `j` $right attr-dict `:` type($left) `j` type($right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I am not a fan of the j
. A bowtie symbol would be nice but I think that'd be too difficult to type ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun fact: I actually initially tried to do this symbol in the assembly format ><
but I was rejected! It was unfortunately not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pitty! Nice idea, though!
test/Dialect/Substrait/join.mlir
Outdated
%2 = join single %0 j %1 : tuple<si32> j tuple<si32> | ||
yield %2 : tuple<si32, si32> | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add a blank line at the end.
For the record: we found out that the original list is actually fine: it corresponds to the list in the protobuf version of the somewhat outdated version of the Substrait git module that we are currently using. |
@ingomueller-net
or
Do we want to display the types of the left and right inputs or just show the type of the result? |
Good question. The second option can be long. At the same time, not all join types have the same rule, right? And some rules are more complex than just "concatenate", right? In other words: is there non-trivial information that we would loose in the first option? |
I mean I think if users don't know/ haven't memorized which of the join types drop/add certain columns, then the second option is better (even though it is long). If we feel that we should assume that users don't need to know this and can look up the documentation if they are confused, I think we can go with the first option. What is the final call? |
I think I'd spell out the types. It's only one of several cases but, as one example, the outer joins will have nullable output fields that do not allow to conclude whether corresponding the input were already nullable or not, so that'd be information that wouldn't be present in the assembly if we don't spell out the input types. |
FOR REFERENCE - my notes for the return type inference, could be helpful: (from substrait.io) For a left input and right input For a match:
No match:
In this PR, we always assume a match. Next PR will address join expressions (so we can also have non-matches) and then we can update when nullability is implemented. |
New assembly format for join op:
--> in tests it will look like this
Writing it out super clearly as I had lots of errors when writing the tests as I had smth that looked liked |
…r type inference. will adjust in next commit. unable to test locally so doing this via CI 💀
"done" from my side until further feedback (finally! sorry it took so long) |
implemented the join operation, return type inference, export, import and all tests.
TODO: Add support for join expressions.
Also: I'm not completely sure I got return type inference correctly. In the Substrait specification, they had a lot of stackable keywords (like
right semi
- each of which is a separate enum). Are these keywords supposed to be able to "stack"? Based on this answer, my interpretation on the return type inference implementation may change. I think return type inference will also change when we introduce nullability, but the way I implemented it now is similar to the cross op - I think I was over-engineering it in my head. Let me know if I'm way off.@ingomueller-net