feat: Add native shuffle and columnar shuffle #30

viirya · 2024-02-15T06:49:14Z

Which issue does this PR close?

Closes #29.

Rationale for this change

As a columnar execution engine plugin to Spark, Comet comes out columnar operation support by providing corresponding operators to replace row-based operators in Spark. Shuffle is also a row-based operation in Spark. It is happened between the boundary of SQL operators which need to exchange data according to some specified distribution requirements. Without columnar shuffle, it means we need to do columnar to row/row to columnar around each shuffle operations. Thus, we propose Comet shuffle operators in this patch.

What changes are included in this PR?

Two kind of shuffle operators are included in this patch: native shuffle and columnar shuffle. Both shuffle operators are columnar-based operations and use same native implementation to write shuffle data into disk. Native shuffle takes columnar batches output from Comet operators directly. Columnar shuffle takes row outputs from downstream operators which could be Spark operators or Comet operators wrapped by ColumnarToRow operator. These rows are converted into columnar batches in the native writer and written into disk.

How are these changes tested?

viirya · 2024-02-15T06:56:23Z

The CI failure is:

Files with unapproved licenses:
  .github/pull_request_template.md

It will be fixed at #32 .

viirya · 2024-02-16T16:20:41Z

cc @sunchao

sunchao

Looks mostly good. The code has already been reviewed internally.

common/src/main/scala/org/apache/comet/CometConf.scala

spark/src/main/scala/org/apache/spark/sql/comet/CometTakeOrderedAndProjectExec.scala

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

sunchao · 2024-02-16T17:30:19Z

Let's also put some details in the PR description.

viirya · 2024-02-16T18:14:31Z

Added some description there.

sunchao

LGTM

viirya · 2024-02-16T19:26:58Z

Merged. Thanks.

* build: Fix references to old Boson build * build: Only publish Comet for Spark 3.4

feat: Add native shuffle and columnar shuffle

43ddbdb

viirya force-pushed the native_shuffle branch from b5956ae to 43ddbdb Compare February 15, 2024 16:42

sunchao reviewed Feb 16, 2024

View reviewed changes

For review

bd6ff61

sunchao approved these changes Feb 16, 2024

View reviewed changes

viirya merged commit c5aee56 into apache:main Feb 16, 2024
2 checks passed

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

build: Fixes for the next release build (apache#30)

f4478ee

* build: Fix references to old Boson build * build: Only publish Comet for Spark 3.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add native shuffle and columnar shuffle #30

feat: Add native shuffle and columnar shuffle #30

viirya commented Feb 15, 2024 •

edited

Loading

viirya commented Feb 15, 2024

viirya commented Feb 16, 2024

sunchao left a comment •

edited

Loading

sunchao commented Feb 16, 2024

viirya commented Feb 16, 2024

sunchao left a comment

viirya commented Feb 16, 2024

feat: Add native shuffle and columnar shuffle #30

feat: Add native shuffle and columnar shuffle #30

Conversation

viirya commented Feb 15, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya commented Feb 15, 2024

viirya commented Feb 16, 2024

sunchao left a comment • edited Loading

Choose a reason for hiding this comment

sunchao commented Feb 16, 2024

viirya commented Feb 16, 2024

sunchao left a comment

Choose a reason for hiding this comment

viirya commented Feb 16, 2024

viirya commented Feb 15, 2024 •

edited

Loading

sunchao left a comment •

edited

Loading