feat: implement `join` operation #32

dshaaban01 · 2024-12-07T00:43:47Z

implemented the join operation, return type inference, export, import and all tests.

TODO: Add support for join expressions.

Also: I'm not completely sure I got return type inference correctly. In the Substrait specification, they had a lot of stackable keywords (like right semi - each of which is a separate enum). Are these keywords supposed to be able to "stack"? Based on this answer, my interpretation on the return type inference implementation may change. I think return type inference will also change when we introduce nullability, but the way I implemented it now is similar to the cross op - I think I was over-engineering it in my head. Let me know if I'm way off.

@ingomueller-net

ingomueller-net · 2024-12-09T08:26:23Z

We need indeed all of the enum values that are also in the protobuf spec:

  enum JoinType {
    JOIN_TYPE_UNSPECIFIED = 0;
    JOIN_TYPE_INNER = 1;
    JOIN_TYPE_OUTER = 2;
    JOIN_TYPE_LEFT = 3;
    JOIN_TYPE_RIGHT = 4;
    JOIN_TYPE_LEFT_SEMI = 5;
    JOIN_TYPE_LEFT_ANTI = 6;
    JOIN_TYPE_LEFT_SINGLE = 7;
    JOIN_TYPE_RIGHT_SEMI = 8;
    JOIN_TYPE_RIGHT_ANTI = 9;
    JOIN_TYPE_RIGHT_SINGLE = 10;
    JOIN_TYPE_LEFT_MARK = 11;
    JOIN_TYPE_RIGHT_MARK = 12;
  }

Each of them is a distinct case. For example, a "left" join is really a "left outer" join (which returns all matching joins partners plus all unmatched tuples from the left input), which is different from a "left anti" join (which returns all tuples from the left input that do not have a matching partner in the right input).

ingomueller-net · 2024-12-09T08:27:02Z

include/substrait-mlir/Dialect/Substrait/IR/SubstraitOps.td

@@ -474,6 +474,36 @@ def Substrait_FilterOp : Substrait_RelOp<"filter", [
  }];
 }

+def JoinTypeKind : I32EnumAttr<"JoinTypeKind",


Nit: move to enums file?

We actually don't have an enums file. the SetOpKind Enum is also defined in this file. Should I do a PR that creates a SubstraitEnums.td file?

Oh, OK! My long-pending PR for aggregate creates one. Let's move the other enums once that is merged?

Lets do that 👍

ingomueller-net · 2024-12-09T08:27:54Z

include/substrait-mlir/Dialect/Substrait/IR/SubstraitOps.td

+  );
+  let results = (outs Substrait_Relation:$result);
+  let assemblyFormat = [{
+    $join_type $left `j` $right attr-dict `:` type($left) `j` type($right)


Nit: I am not a fan of the j. A bowtie symbol would be nice but I think that'd be too difficult to type ;)

Fun fact: I actually initially tried to do this symbol in the assembly format >< but I was rejected! It was unfortunately not allowed.

Pitty! Nice idea, though!

ingomueller-net · 2024-12-09T09:46:39Z

test/Dialect/Substrait/join.mlir

+    %2 = join single %0 j %1 : tuple<si32> j tuple<si32> 
+    yield %2 : tuple<si32, si32>
+  }
+}


Nit: add a blank line at the end.

ingomueller-net · 2024-12-09T09:48:08Z

We need indeed all of the enum values that are also in the protobuf spec:
...

For the record: we found out that the original list is actually fine: it corresponds to the list in the protobuf version of the somewhat outdated version of the Substrait git module that we are currently using.

dshaaban01 · 2024-12-11T14:33:03Z

@ingomueller-net
Do you prefer this assembly code

$join_type join$left,$right attr-dict:type($result)

or

$join_type join$left,$right attr-dict:type($left),type($right) -> type($result)

Do we want to display the types of the left and right inputs or just show the type of the result?

ingomueller-net · 2024-12-11T19:32:25Z

Good question. The second option can be long. At the same time, not all join types have the same rule, right? And some rules are more complex than just "concatenate", right? In other words: is there non-trivial information that we would loose in the first option?

dshaaban01 · 2024-12-12T13:58:02Z

Good question. The second option can be long. At the same time, not all join types have the same rule, right? And some rules are more complex than just "concatenate", right? In other words: is there non-trivial information that we would loose in the first option?

I mean I think if users don't know/ haven't memorized which of the join types drop/add certain columns, then the second option is better (even though it is long). If we feel that we should assume that users don't need to know this and can look up the documentation if they are confused, I think we can go with the first option. What is the final call?

ingomueller-net · 2024-12-12T14:07:33Z

What is the final call?

I think I'd spell out the types.

It's only one of several cases but, as one example, the outer joins will have nullable output fields that do not allow to conclude whether corresponding the input were already nullable or not, so that'd be information that wouldn't be present in the assembly if we don't spell out the input types.

dshaaban01 · 2024-12-12T15:51:20Z

FOR REFERENCE - my notes for the return type inference, could be helpful: (from substrait.io)

For a left input and right input

For a match:

Inner, Outer, Left, Right --> returns all columns
(Left) Semi --> just return columns from left input
(Left) Anti --> ignored - just return columns from left input
(Left) Single --> just return columns from right input

No match:

Inner --> ignored - all columns
(Left) Semi --> ignored -columns from right input
Outer --> return all columns along with nulls for opposite input (could be left or right)
Left --> return left columns along with nulls for right input
Right --> return right columns along with nulls for left input
(Left) Anti --> just return columns from left input
(Left) Single --> just return columns from right input (but with all nulls)

In this PR, we always assume a match. Next PR will address join expressions (so we can also have non-matches) and then we can update when nullability is implemented.

dshaaban01 · 2024-12-26T16:32:57Z

New assembly format for join op:

$join_type $left ',' $right attr-dict ':' type($left) ',' type($right) '->' type($result)

--> in tests it will look like this

%2 = join inner %0, %1 : tuple<si32> , tuple<si32> -> tuple<si32,si32>

Writing it out super clearly as I had lots of errors when writing the tests as I had smth that looked liked %2 = join inner join %0 ... based on our above conversation which is what I don't think we want. Also I tried to manipulate it such that we had %2 = inner join %0 ... but this is not accepted by mlir, join keyword must appear before enum.

…r type inference. will adjust in next commit. unable to test locally so doing this via CI 💀

dshaaban01 · 2024-12-26T22:37:18Z

"done" from my side until further feedback (finally! sorry it took so long)

dshaaban01 added 2 commits December 7, 2024 01:36

implement join

095af9b

cLaNg sAgA 💀

959c396

ingomueller-net reviewed Dec 9, 2024

View reviewed changes

hi

fa7646e

minor

0172bcc

dshaaban01 added 5 commits December 26, 2024 18:47

first draft new join - i have commented out my first attempt at prope…

f60e77d

…r type inference. will adjust in next commit. unable to test locally so doing this via CI 💀

small error

f77df3d

fix tests

396a9d7

TYPE INFERENCE HALLELUJAH

afc3b8e

clang tingz

ff2b1e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement `join` operation #32

feat: implement `join` operation #32

dshaaban01 commented Dec 7, 2024

ingomueller-net commented Dec 9, 2024

ingomueller-net Dec 9, 2024

dshaaban01 Dec 10, 2024 •

edited

Loading

ingomueller-net Dec 11, 2024

dshaaban01 Dec 11, 2024

ingomueller-net Dec 9, 2024

dshaaban01 Dec 9, 2024

ingomueller-net Dec 10, 2024

ingomueller-net Dec 9, 2024

ingomueller-net commented Dec 9, 2024

dshaaban01 commented Dec 11, 2024 •

edited

Loading

ingomueller-net commented Dec 11, 2024

dshaaban01 commented Dec 12, 2024

ingomueller-net commented Dec 12, 2024

dshaaban01 commented Dec 12, 2024 •

edited

Loading

dshaaban01 commented Dec 26, 2024 •

edited

Loading

dshaaban01 commented Dec 26, 2024

feat: implement join operation #32

Are you sure you want to change the base?

feat: implement join operation #32

Conversation

dshaaban01 commented Dec 7, 2024

ingomueller-net commented Dec 9, 2024

ingomueller-net Dec 9, 2024

Choose a reason for hiding this comment

dshaaban01 Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

ingomueller-net Dec 11, 2024

Choose a reason for hiding this comment

dshaaban01 Dec 11, 2024

Choose a reason for hiding this comment

ingomueller-net Dec 9, 2024

Choose a reason for hiding this comment

dshaaban01 Dec 9, 2024

Choose a reason for hiding this comment

ingomueller-net Dec 10, 2024

Choose a reason for hiding this comment

ingomueller-net Dec 9, 2024

Choose a reason for hiding this comment

ingomueller-net commented Dec 9, 2024

dshaaban01 commented Dec 11, 2024 • edited Loading

ingomueller-net commented Dec 11, 2024

dshaaban01 commented Dec 12, 2024

ingomueller-net commented Dec 12, 2024

dshaaban01 commented Dec 12, 2024 • edited Loading

dshaaban01 commented Dec 26, 2024 • edited Loading

dshaaban01 commented Dec 26, 2024

feat: implement `join` operation #32

feat: implement `join` operation #32

dshaaban01 Dec 10, 2024 •

edited

Loading

dshaaban01 commented Dec 11, 2024 •

edited

Loading

dshaaban01 commented Dec 12, 2024 •

edited

Loading

dshaaban01 commented Dec 26, 2024 •

edited

Loading