Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Binary in shuffle writer #106

Merged
merged 3 commits into from
Feb 25, 2024

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Feb 24, 2024

Which issue does this PR close?

Closes #105

Rationale for this change

bug fixes

What changes are included in this PR?

Add binary pattern matching in shuffle_writer.rs

How are these changes tested?

Add new test cases in rust and spark side

@advancedxy
Copy link
Contributor Author

I'm not sure why this issue doesn't happen before... Seems like binary data are already used in various shuffle suites.

Comment on lines +318 to +320
// TODO: this is not accurate, but should be good enough for now
DataType::Binary => len * 100 + len * 4,
DataType::LargeBinary => len * 100 + len * 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed binary type.

@viirya
Copy link
Member

viirya commented Feb 24, 2024

I'm not sure why this issue doesn't happen before... Seems like binary data are already used in various shuffle suites.

They use columnar shuffle, I think, which has much more coverage.

@advancedxy
Copy link
Contributor Author

They use columnar shuffle, I think, which has much more coverage.

I see. Thanks, let me try to add a test case in the spark side that leverages native shuffle.

@advancedxy
Copy link
Contributor Author

let me try to add a test case in the spark side that leverages native shuffle.

Fixed. Please let me know that whether we should put test code in the shuffle_writer.rs or in a different file.

@viirya
Copy link
Member

viirya commented Feb 25, 2024

The TPCDS CI pipeline failure is going to be fixed at #108.

@advancedxy advancedxy closed this Feb 25, 2024
@advancedxy advancedxy reopened this Feb 25, 2024
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunchao sunchao merged commit 749731b into apache:main Feb 25, 2024
29 of 30 checks passed
@sunchao
Copy link
Member

sunchao commented Feb 25, 2024

Merged, thanks!

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
Force using the provided Comet version by purging previous snapshots from the local repository.  This avoids inadvertently picking up the wrong snapshot.  This only occurs if `dev/install-comet-spark.sh` is explicitly provided a Comet version to use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data type Binary not supported in shuffle write
3 participants