Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add native shuffle and columnar shuffle #30

Merged
merged 2 commits into from
Feb 16, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 15, 2024

Which issue does this PR close?

Closes #29.

Rationale for this change

As a columnar execution engine plugin to Spark, Comet comes out columnar operation support by providing corresponding operators to replace row-based operators in Spark. Shuffle is also a row-based operation in Spark. It is happened between the boundary of SQL operators which need to exchange data according to some specified distribution requirements. Without columnar shuffle, it means we need to do columnar to row/row to columnar around each shuffle operations. Thus, we propose Comet shuffle operators in this patch.

What changes are included in this PR?

Two kind of shuffle operators are included in this patch: native shuffle and columnar shuffle. Both shuffle operators are columnar-based operations and use same native implementation to write shuffle data into disk. Native shuffle takes columnar batches output from Comet operators directly. Columnar shuffle takes row outputs from downstream operators which could be Spark operators or Comet operators wrapped by ColumnarToRow operator. These rows are converted into columnar batches in the native writer and written into disk.

How are these changes tested?

@viirya
Copy link
Member Author

viirya commented Feb 15, 2024

The CI failure is:

Files with unapproved licenses:
  .github/pull_request_template.md

It will be fixed at #32 .

@viirya
Copy link
Member Author

viirya commented Feb 16, 2024

cc @sunchao

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. The code has already been reviewed internally.

@sunchao
Copy link
Member

sunchao commented Feb 16, 2024

Let's also put some details in the PR description.

@viirya
Copy link
Member Author

viirya commented Feb 16, 2024

Added some description there.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@viirya viirya merged commit c5aee56 into apache:main Feb 16, 2024
2 checks passed
@viirya
Copy link
Member Author

viirya commented Feb 16, 2024

Merged. Thanks.

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* build: Fix references to old Boson build

* build: Only publish Comet for Spark 3.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Shuffle support
2 participants