Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random access array take using P2P #8774

Open
fjetter opened this issue Jul 17, 2024 · 2 comments
Open

Random access array take using P2P #8774

fjetter opened this issue Jul 17, 2024 · 2 comments
Labels
feature Something is missing shuffle

Comments

@fjetter
Copy link
Member

fjetter commented Jul 17, 2024

Slicing in dask array effectively generates a task per contiguous subslice per chunk.

For the worst case of random indexing this generates a slice/task for every row along this dimension. Dask is currently raising a PerformanceWarning once we detect this situation, see https://github.com/dask/dask/blob/b4b33caed8fc9cf77c9332442ab11cf00f90bb42/dask/array/slicing.py#L630-L641

Worst case example

import dask.array as da
import numpy as np

x = da.random.random((10, 20), chunks=(10, 10))
idx = np.random.randint(0, x.shape[1], x.shape[1])

y = x[:, idx]

This random access pattern is another shuffle pattern and we should be able to offer an efficient solution to this using our P2P infrastructure

see also pydata/xarray#9220

@fjetter fjetter added feature Something is missing shuffle and removed needs triage labels Jul 17, 2024
@fjetter
Copy link
Member Author

fjetter commented Jul 17, 2024

It might be necessary / helpful to first deal with dask/dask#11234

@fjetter
Copy link
Member Author

fjetter commented Jul 17, 2024

As a first step for this I would like to understand how much of the P2P rechunk logic can be reused

@hendrikmakait hendrikmakait self-assigned this Jul 18, 2024
@hendrikmakait hendrikmakait removed their assignment Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Something is missing shuffle
Projects
None yet
Development

No branches or pull requests

2 participants