-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add native shuffle and columnar shuffle #30
Conversation
The CI failure is:
It will be fixed at #32 . |
b5956ae
to
43ddbdb
Compare
cc @sunchao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good. The code has already been reviewed internally.
spark/src/main/scala/org/apache/spark/sql/comet/CometTakeOrderedAndProjectExec.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala
Outdated
Show resolved
Hide resolved
Let's also put some details in the PR description. |
Added some description there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged. Thanks. |
* build: Fix references to old Boson build * build: Only publish Comet for Spark 3.4
Which issue does this PR close?
Closes #29.
Rationale for this change
As a columnar execution engine plugin to Spark, Comet comes out columnar operation support by providing corresponding operators to replace row-based operators in Spark. Shuffle is also a row-based operation in Spark. It is happened between the boundary of SQL operators which need to exchange data according to some specified distribution requirements. Without columnar shuffle, it means we need to do columnar to row/row to columnar around each shuffle operations. Thus, we propose Comet shuffle operators in this patch.
What changes are included in this PR?
Two kind of shuffle operators are included in this patch: native shuffle and columnar shuffle. Both shuffle operators are columnar-based operations and use same native implementation to write shuffle data into disk. Native shuffle takes columnar batches output from Comet operators directly. Columnar shuffle takes row outputs from downstream operators which could be Spark operators or Comet operators wrapped by ColumnarToRow operator. These rows are converted into columnar batches in the native writer and written into disk.
How are these changes tested?