docs: Add some documentation explaining how shuffle works #1148

andygrove · 2024-12-06T16:22:06Z

Which issue does this PR close?

Closes #977

Rationale for this change

I didn't fully understand how shuffle writes worked and why it involved two separate native plans. Now that I do, I am documenting it for the next person who is curious about this.

What changes are included in this PR?

More docs.

How are these changes tested?

parthchandra · 2024-12-06T17:02:07Z

Thank you for shedding light on this @andygrove !

viirya

Note that this is not invented by Comet but basically how Spark shuffle works.

* feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+ * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063) * docs: Update benchmarking.md (#1085) * feat: Require offHeap memory to be enabled (always use unified memory) (#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087) * Add changelog for 0.4.0 (#1089) * chore: Prepare for 0.5.0 development (#1090) * Update version number for build * update docs * build: Skip installation of spark-integration and fuzz testing modules (#1091) * Add hint for finding the GPG key to use when publishing to maven (#1093) * docs: Update documentation for 0.4.0 release (#1096) * update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72 * fix: Unsigned type related bugs (#1095) ## Which issue does this PR close? Closes #1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types * chore: Include first ScanExec batch in metrics (#1105) * include first batch in ScanExec metrics * record row count metric * fix regression * chore: Improve CometScan metrics (#1100) * Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows * chore: Add custom metric for native shuffle fetching batches from JVM (#1108) * feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <[email protected]> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <[email protected]> * feat: enable decimal to decimal cast of different precision and scale (#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <[email protected]> * fix the nit in comment --------- Co-authored-by: himadripal <[email protected]> Co-authored-by: Andy Grove <[email protected]> * docs: fix readme FGPA/FPGA typo (#1117) * fix: Use RDD partition index (#1112) * fix: Use RDD partition index * fix * fix * fix * fix: Various metrics bug fixes and improvements (#1111) * fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129) * Use exact class comparison for parquet scan * Add test * Add comment * fix: Fix metrics regressions (#1132) * fix metrics issues * clippy * update tests * docs: Add more technical detail and new diagram to Comet plugin overview (#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback * Stop passing Java config map into native createPlan (#1101) * feat: Improve ScanExec native metrics (#1133) * save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback * chore: Remove unused StringView struct (#1143) * Remove unused StringView struct * remove more dead code * docs: Add some documentation explaining how shuffle works (#1148) * add some notes on shuffle * reads * improve docs * test: enable more Spark 4.0 tests (#1145) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled * chore: Refactor cast to use SparkCastOptions param (#1146) * Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches * Enable more scenarios in CometExecBenchmark. (#1151) * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * Fix redundancy in Cargo.lock. * Format, more post-merge cleanup. * Compiles * Compiles * Remove empty file. * Attempt to fix JNI issue and native test build issues. * Test Fix * Update planner.rs Remove println from test. --------- Co-authored-by: NoeB <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Parth Chandra <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Sem <[email protected]> Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: himadripal <[email protected]> Co-authored-by: gstvg <[email protected]> Co-authored-by: Adam Binford <[email protected]>

add some notes on shuffle

b298f76

andygrove marked this pull request as draft December 6, 2024 16:45

reads

e39b3b7

parthchandra approved these changes Dec 6, 2024

View reviewed changes

viirya approved these changes Dec 6, 2024

View reviewed changes

improve docs

27d55d4

andygrove marked this pull request as ready for review December 6, 2024 20:04

andygrove merged commit b95dc1d into apache:main Dec 6, 2024

andygrove deleted the shuffle-notes branch December 6, 2024 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add some documentation explaining how shuffle works #1148

docs: Add some documentation explaining how shuffle works #1148

andygrove commented Dec 6, 2024

parthchandra commented Dec 6, 2024

viirya left a comment

docs: Add some documentation explaining how shuffle works #1148

docs: Add some documentation explaining how shuffle works #1148

Conversation

andygrove commented Dec 6, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

parthchandra commented Dec 6, 2024

viirya left a comment

Choose a reason for hiding this comment