Add GPU Support for rlaunch multi #495

sivonxay · 2023-02-10T20:11:01Z

There is currently no way to distribute GPUs among fireworks when running small jobs in parallel on one system.

An example: On NERSC, you get exclusive access to 1 Perlmutter nodes with 4 A100 GPUs. If you were to run 4 fireworks that require 1 GPU each, using rlaunch multi 4, each firework would be responsible for determining which GPUs to run on. Most python code will default to checking the CUDA_VISIBLE_DEVICES and either taking the first or all gpus resulting in an oversubscription leading to poor performance or an error.

I don't believe this implementation would work for systems with non-NVIDIA/CUDA GPUs. I believe AMD devices require setting the HIP_VISIBLE_DEVICES variable, but I don't have access to any system with multiple AMD GPUs to test that.

This might not be the best way to implement this, but it does raise a question about whether or not there is a need for a more general way to distribute non-CPU devices (GPU and TPU) among sub-jobs.

sivonxay and others added 7 commits February 7, 2023 18:29

Add gpu allocating functionality to rlaunch multi

503b910

set default for gpus_per_node

29a9d9f

Should pass list of None if no GPU, rather than None

f2da81d

Add the parsed arg to the function call

89604d2

Remove extraneous debug print

48e7689

Remove check on number of gpus if nodelist is not specified

6bde645

Fix indentation for linting errors

cae025b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU Support for rlaunch multi #495

Add GPU Support for rlaunch multi #495

sivonxay commented Feb 10, 2023

Add GPU Support for rlaunch multi #495

Are you sure you want to change the base?

Add GPU Support for rlaunch multi #495

Conversation

sivonxay commented Feb 10, 2023