[Data] optimize dataset.unique() #49296

wingkitlee0 · 2024-12-17T02:30:20Z

Why are these changes needed?

The current implementation uses groupby(column).count() that causes a full sort. The new implementation uses AggregateFn which uses groupby(None) and set() to aggregate unique values.

The time complexity should be O(N / parallelism) according to ds.aggregate().

It's about 10x faster in my local test.

Some part of test_unique is removed because it was designed for the original implementation.

Related issue number

Closes #49298

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kit Lee <[email protected]>

wingkitlee0 force-pushed the optimize-dataset-unique branch 2 times, most recently from f21dbeb to a2270a7 Compare December 17, 2024 03:42

richardliaw assigned raulchen Dec 17, 2024

wingkitlee0 force-pushed the optimize-dataset-unique branch 4 times, most recently from 91c0e5a to 3802155 Compare December 19, 2024 03:09

wingkitlee0 marked this pull request as ready for review December 19, 2024 03:10

wingkitlee0 requested a review from a team as a code owner December 19, 2024 03:10

[Data] Optimize dataset.unique()

5c5cc7f

Signed-off-by: Kit Lee <[email protected]>

wingkitlee0 force-pushed the optimize-dataset-unique branch from 2882ed5 to 5c5cc7f Compare December 19, 2024 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] optimize dataset.unique() #49296

[Data] optimize dataset.unique() #49296

wingkitlee0 commented Dec 17, 2024 •

edited

Loading

[Data] optimize dataset.unique() #49296

Are you sure you want to change the base?

[Data] optimize dataset.unique() #49296

Conversation

wingkitlee0 commented Dec 17, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

wingkitlee0 commented Dec 17, 2024 •

edited

Loading