You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am trying to run the example script provided for llama model for inference only. Since the repository is going through migration and a lot of changes, I went back and install the stable v0.2.0 version. Everything works fine until I started trying to run the example script using cpu-initialization on more than 2 pipeline stages. I am currently running on a server with 8 gpus of Nvidia L4. For pp = 2 it works perfectly, but as soon as I run the same script with pp more than 2, after the model is initialized, all the other gpus have 0 utilization according to nvidia-smi output, and the gpu ranked 1 will have 100% util, yet the entire inference process freezes. Has anyone seeing similar issues? Or perhaps there are some quick fix I can try?
NVCC and Cuda Verison: 12.1.
torch version: 2.4.0.dev20240521+cu118.
The text was updated successfully, but these errors were encountered:
I tried downgrading torch to stable 2.3.0 and the same problem occurs. The example script that I am running is /examples/llama/pippy_llama.py. Since this could be a problem with Pippy v0.2.0, I will try later with a different Pippy version.
Hi,
I am trying to run the example script provided for llama model for inference only. Since the repository is going through migration and a lot of changes, I went back and install the stable
v0.2.0
version. Everything works fine until I started trying to run the example script using cpu-initialization on more than 2 pipeline stages. I am currently running on a server with 8 gpus of Nvidia L4. Forpp = 2
it works perfectly, but as soon as I run the same script with pp more than 2, after the model is initialized, all the other gpus have 0 utilization according tonvidia-smi
output, and the gpu ranked 1 will have 100% util, yet the entire inference process freezes. Has anyone seeing similar issues? Or perhaps there are some quick fix I can try?NVCC and Cuda Verison:
12.1
.torch version:
2.4.0.dev20240521+cu118
.The text was updated successfully, but these errors were encountered: