-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Some questions focus on the transports in RCCL, including P2P, SHM, and NET #96
Comments
Hi @Kyrienn. Internal ticket has been created to assist with your questions. Thanks! |
Hi @Kyrienn , thanks for posting questions. Before we are proceeding with your questions, can you check your system/rocm software and hardware info with https://rocm.docs.amd.com/projects/rccl/en/develop/how-to/troubleshooting-rccl.html
Can you please post the above info first? Thanks. |
Hi @huanrwan-amd this is info
|
============================ ROCm System Management Interface ============================ ================================= Hops between two GPUs ================================== =============================== Link Type between two GPUs =============================== ======================================= Numa Nodes ======================================= |
Hi @Kyrienn, can you update the amdgpu kernel driver with amdgpu installer: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#install-amdgpu-dkms . |
The weight value 40 is typical for a PCIe connections. For xGMI connection, the weight is lower, around 10-20. |
after amdgpu-install --usecase=dkms, it is still 6.8.5 Building dependency tree... Done |
1.Does RX7900XT support the transport of P2P ? 2.If I have a single machine with two AMD GPUs and run rccl-test/all_reduce_perf, is the transport mechanism selected as P2P first and then SHM? Why the bandwidth speed is same whether I set the environment variables by setting export NCCL_SHM_Disable=1, export NCCL_P2P_Disable=1, or not setting either of them? 3.What are the differences between the transport of P2P and SHM ? What additional hardware requirements must be met for P2P in rccl, xgmi ? 4.In multi-node multi-GPU setups, regardless of whether it is broadcast, all-reduce, or all-to-all ... , can the transport only use net, one node just choose P2P or SHM? 5.Why does the output only show INFO level logs even when i set export NCCL_DEBUG=TRACE? |
Hi @Kyrienn , did you remove the older version of kernel driver? After that apply
And update the driver again. In general, it is better to update to the latest driver and review the output log. |
Problem Description
2.If I have a single machine with two AMD GPUs and run rccl-test/all_reduce_perf, is the transport mechanism selected as P2P first and then SHM? Why the bandwidth speed is same whether I set the environment variables by setting export NCCL_SHM_Disable=1, export NCCL_P2P_Disable=1, or not setting either of them?
3.What are the differences between the transport of P2P and SHM ? What additional hardware requirements must be met for P2P in rccl, xgmi ?
4.In multi-node multi-GPU setups, regardless of whether it is broadcast, all-reduce, or all-to-all ... , can the transport only use net, one node just choose P2P or SHM?
Operating System
ubuntu-24.04
CPU
Intel(R) Core(TM) i7-14700K
GPU
2x AMD Radeon RX GPU 7900XT
ROCm Version
ROCm 6.3.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: