Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement join operation #32

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dshaaban01
Copy link
Contributor

implemented the join operation, return type inference, export, import and all tests.

TODO: Add support for join expressions.

Also: I'm not completely sure I got return type inference correctly. In the Substrait specification, they had a lot of stackable keywords (like right semi - each of which is a separate enum). Are these keywords supposed to be able to "stack"? Based on this answer, my interpretation on the return type inference implementation may change. I think return type inference will also change when we introduce nullability, but the way I implemented it now is similar to the cross op - I think I was over-engineering it in my head. Let me know if I'm way off.

@ingomueller-net

@ingomueller-net
Copy link
Collaborator

We need indeed all of the enum values that are also in the protobuf spec:

  enum JoinType {
    JOIN_TYPE_UNSPECIFIED = 0;
    JOIN_TYPE_INNER = 1;
    JOIN_TYPE_OUTER = 2;
    JOIN_TYPE_LEFT = 3;
    JOIN_TYPE_RIGHT = 4;
    JOIN_TYPE_LEFT_SEMI = 5;
    JOIN_TYPE_LEFT_ANTI = 6;
    JOIN_TYPE_LEFT_SINGLE = 7;
    JOIN_TYPE_RIGHT_SEMI = 8;
    JOIN_TYPE_RIGHT_ANTI = 9;
    JOIN_TYPE_RIGHT_SINGLE = 10;
    JOIN_TYPE_LEFT_MARK = 11;
    JOIN_TYPE_RIGHT_MARK = 12;
  }

Each of them is a distinct case. For example, a "left" join is really a "left outer" join (which returns all matching joins partners plus all unmatched tuples from the left input), which is different from a "left anti" join (which returns all tuples from the left input that do not have a matching partner in the right input).

@@ -474,6 +474,36 @@ def Substrait_FilterOp : Substrait_RelOp<"filter", [
}];
}

def JoinTypeKind : I32EnumAttr<"JoinTypeKind",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move to enums file?

Copy link
Contributor Author

@dshaaban01 dshaaban01 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't have an enums file. the SetOpKind Enum is also defined in this file. Should I do a PR that creates a SubstraitEnums.td file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, OK! My long-pending PR for aggregate creates one. Let's move the other enums once that is merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets do that 👍

);
let results = (outs Substrait_Relation:$result);
let assemblyFormat = [{
$join_type $left `j` $right attr-dict `:` type($left) `j` type($right)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I am not a fan of the j. A bowtie symbol would be nice but I think that'd be too difficult to type ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun fact: I actually initially tried to do this symbol in the assembly format >< but I was rejected! It was unfortunately not allowed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pitty! Nice idea, though!

%2 = join single %0 j %1 : tuple<si32> j tuple<si32>
yield %2 : tuple<si32, si32>
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add a blank line at the end.

@ingomueller-net
Copy link
Collaborator

We need indeed all of the enum values that are also in the protobuf spec:
...

For the record: we found out that the original list is actually fine: it corresponds to the list in the protobuf version of the somewhat outdated version of the Substrait git module that we are currently using.

@dshaaban01
Copy link
Contributor Author

dshaaban01 commented Dec 11, 2024

@ingomueller-net
Do you prefer this assembly code

$join_type join$left,$right attr-dict:type($result)

or

$join_type join$left,$right attr-dict:type($left),type($right) -> type($result)

Do we want to display the types of the left and right inputs or just show the type of the result?

@ingomueller-net
Copy link
Collaborator

Good question. The second option can be long. At the same time, not all join types have the same rule, right? And some rules are more complex than just "concatenate", right? In other words: is there non-trivial information that we would loose in the first option?

@dshaaban01
Copy link
Contributor Author

Good question. The second option can be long. At the same time, not all join types have the same rule, right? And some rules are more complex than just "concatenate", right? In other words: is there non-trivial information that we would loose in the first option?

I mean I think if users don't know/ haven't memorized which of the join types drop/add certain columns, then the second option is better (even though it is long). If we feel that we should assume that users don't need to know this and can look up the documentation if they are confused, I think we can go with the first option. What is the final call?

@ingomueller-net
Copy link
Collaborator

What is the final call?

I think I'd spell out the types.

It's only one of several cases but, as one example, the outer joins will have nullable output fields that do not allow to conclude whether corresponding the input were already nullable or not, so that'd be information that wouldn't be present in the assembly if we don't spell out the input types.

@dshaaban01
Copy link
Contributor Author

dshaaban01 commented Dec 12, 2024

FOR REFERENCE - my notes for the return type inference, could be helpful: (from substrait.io)

For a left input and right input

For a match:

  • Inner, Outer, Left, Right --> returns all columns
  • (Left) Semi --> just return columns from left input
  • (Left) Anti --> ignored - just return columns from left input
  • (Left) Single --> just return columns from right input

No match:

  • Inner --> ignored - all columns
  • (Left) Semi --> ignored -columns from right input
  • Outer --> return all columns along with nulls for opposite input (could be left or right)
  • Left --> return left columns along with nulls for right input
  • Right --> return right columns along with nulls for left input
  • (Left) Anti --> just return columns from left input
  • (Left) Single --> just return columns from right input (but with all nulls)

In this PR, we always assume a match. Next PR will address join expressions (so we can also have non-matches) and then we can update when nullability is implemented.

@dshaaban01
Copy link
Contributor Author

dshaaban01 commented Dec 26, 2024

New assembly format for join op:

$join_type $left ',' $right attr-dict ':' type($left) ',' type($right) '->' type($result)

--> in tests it will look like this

%2 = join inner %0, %1 : tuple<si32> , tuple<si32> -> tuple<si32,si32>

Writing it out super clearly as I had lots of errors when writing the tests as I had smth that looked liked %2 = join inner join %0 ... based on our above conversation which is what I don't think we want. Also I tried to manipulate it such that we had %2 = inner join %0 ... but this is not accepted by mlir, join keyword must appear before enum.

@dshaaban01
Copy link
Contributor Author

"done" from my side until further feedback (finally! sorry it took so long)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants