Implement faster single batch encoding/decoding for use in shuffle #1189

andygrove · 2024-12-20T16:45:11Z

What is the problem the feature request solves?

Because Spark shuffle is block-based rather than streaming, it is necessary to serialize single batches along with schema information. We currently use Arrow IPC to do this, but this is not efficient. A crude prototype in PR TBD shows that we can get much better performance with a custom implementation of single batch serde.

Describe the potential solution

No response

Additional context

No response

andygrove added the enhancement New feature or request label Dec 20, 2024

This was referenced Dec 20, 2024

[EPIC] Improve shuffle performance #1123

Open

[do not review] feat: Implement fast serde for single record batches #1190

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement faster single batch encoding/decoding for use in shuffle #1189

Implement faster single batch encoding/decoding for use in shuffle #1189

andygrove commented Dec 20, 2024

Implement faster single batch encoding/decoding for use in shuffle #1189

Implement faster single batch encoding/decoding for use in shuffle #1189

Comments

andygrove commented Dec 20, 2024

What is the problem the feature request solves?

Describe the potential solution

Additional context