Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Some questions focus on the transports in RCCL, including P2P, SHM, and NET #96

Open
Kyrienn opened this issue Dec 20, 2024 · 9 comments

Comments

@Kyrienn
Copy link

Kyrienn commented Dec 20, 2024

Problem Description

  1. Does RX7900XT support the transport of P2P ?

2.If I have a single machine with two AMD GPUs and run rccl-test/all_reduce_perf, is the transport mechanism selected as P2P first and then SHM? Why the bandwidth speed is same whether I set the environment variables by setting export NCCL_SHM_Disable=1, export NCCL_P2P_Disable=1, or not setting either of them?

3.What are the differences between the transport of P2P and SHM ? What additional hardware requirements must be met for P2P in rccl, xgmi ?

4.In multi-node multi-GPU setups, regardless of whether it is broadcast, all-reduce, or all-to-all ... , can the transport only use net, one node just choose P2P or SHM?

  1. Why does the output only show INFO level logs even when i set export NCCL_DEBUG=TRACE?

Operating System

ubuntu-24.04

CPU

Intel(R) Core(TM) i7-14700K

GPU

2x AMD Radeon RX GPU 7900XT

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @Kyrienn. Internal ticket has been created to assist with your questions. Thanks!

@huanrwan-amd
Copy link

Hi @Kyrienn , thanks for posting questions. Before we are proceeding with your questions, can you check your system/rocm software and hardware info with https://rocm.docs.amd.com/projects/rccl/en/develop/how-to/troubleshooting-rccl.html

  • rocminfo: for name of the GPU or accelerator
  • rocm-smi --showtopo: display the system topology
  • dkms status: to check amdgpu kernel driver version

Can you please post the above info first? Thanks.

@Kyrienn
Copy link
Author

Kyrienn commented Dec 24, 2024

Hi @huanrwan-amd this is info
ROCk module version 6.8.5 is loaded

HSA System Attributes

Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: Intel(R) Core(TM) i7-14700K
Uuid: CPU-XX
Marketing Name: Intel(R) Core(TM) i7-14700K
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5500
BDFID: 0
Internal Node ID: 0
Compute Unit: 28
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65541592(0x3e815d8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65541592(0x3e815d8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65541592(0x3e815d8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1100
Uuid: GPU-ebbfb7875510dab5
Marketing Name: Radeon RX 7900 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 81920(0x14000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2175
BDFID: 768
Internal Node ID: 1
Compute Unit: 84
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 342
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 3


Name: gfx1100
Uuid: GPU-0376117fa45511c3
Marketing Name: Radeon RX 7900 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 81920(0x14000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2175
BDFID: 1536
Internal Node ID: 2
Compute Unit: 84
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 342
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

======================================= ROCm System Management Interface =======================================
================================================= Concise Info =================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)

0 1 0x744c, 29122 16.0°C 8.0W N/A, N/A, 0 27Mhz 96Mhz 0% auto 282.0W 5% 0%
1 2 0x744c, 8705 32.0°C 29.0W N/A, N/A, 0 42Mhz 96Mhz 0% auto 282.0W 3% 0%

============================================= End of ROCm SMI Log ==============================================

amdgpu/6.8.5-2070768.24.04, 6.8.0-49-generic, x86_64: installed
amdgpu/6.8.5-2070768.24.04, 6.8.0-51-generic, x86_64: installed

@Kyrienn
Copy link
Author

Kyrienn commented Dec 24, 2024

============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
GPU0 GPU1
GPU0 0 40
GPU1 40 0

================================= Hops between two GPUs ==================================
GPU0 GPU1
GPU0 0 2
GPU1 2 0

=============================== Link Type between two GPUs ===============================
GPU0 GPU1
GPU0 0 PCIE
GPU1 PCIE 0

======================================= Numa Nodes =======================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: -1
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================
the link type shown is PCIE and why weight is 40?what is the general standard for weight with AMD device ?

@huanrwan-amd
Copy link

Hi @Kyrienn, can you update the amdgpu kernel driver with amdgpu installer: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#install-amdgpu-dkms .
To this time 2024.12.24, the latest should be amdgpu/6.10.5-2084815.22.04

@huanrwan-amd
Copy link

huanrwan-amd commented Dec 24, 2024

============================ ROCm System Management Interface ============================ ================================ Weight between two GPUs ================================= GPU0 GPU1 GPU0 0 40 GPU1 40 0

================================= Hops between two GPUs ================================== GPU0 GPU1 GPU0 0 2 GPU1 2 0

=============================== Link Type between two GPUs =============================== GPU0 GPU1 GPU0 0 PCIE GPU1 PCIE 0

======================================= Numa Nodes ======================================= GPU[0] : (Topology) Numa Node: 0 GPU[0] : (Topology) Numa Affinity: -1 GPU[1] : (Topology) Numa Node: 0 GPU[1] : (Topology) Numa Affinity: -1 ================================== End of ROCm SMI Log =================================== the link type shown is PCIE and why weight is 40?what is the general standard for weight with AMD device ?

The weight value 40 is typical for a PCIe connections. For xGMI connection, the weight is lower, around 10-20.

@Kyrienn
Copy link
Author

Kyrienn commented Dec 25, 2024

Hi @Kyrienn, can you update the amdgpu kernel driver with amdgpu installer: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#install-amdgpu-dkms . To this time 2024.12.24, the latest should be amdgpu/6.10.5-2084815.22.04

after amdgpu-install --usecase=dkms, it is still 6.8.5

Building dependency tree... Done
Reading state information... Done
amdgpu-dkms is already the newest version (1:6.8.5.60204-2070768.24.04).
linux-headers-6.8.0-49-generic is already the newest version (6.8.0-49.49).
0 upgraded, 0 newly installed, 0 to remove and 127 not upgraded.
(base) kyriechen@kyriechen-System-Product-Name:~/rccl-tests/build$ sudo dkms status
amdgpu/6.8.5-2070768.24.04, 6.8.0-49-generic, x86_64: installed
amdgpu/6.8.5-2070768.24.04, 6.8.0-51-generic, x86_64: installed

@Kyrienn Kyrienn closed this as completed Dec 25, 2024
@Kyrienn Kyrienn reopened this Dec 25, 2024
@Kyrienn
Copy link
Author

Kyrienn commented Dec 25, 2024

1.Does RX7900XT support the transport of P2P ?

2.If I have a single machine with two AMD GPUs and run rccl-test/all_reduce_perf, is the transport mechanism selected as P2P first and then SHM? Why the bandwidth speed is same whether I set the environment variables by setting export NCCL_SHM_Disable=1, export NCCL_P2P_Disable=1, or not setting either of them?

3.What are the differences between the transport of P2P and SHM ? What additional hardware requirements must be met for P2P in rccl, xgmi ?

4.In multi-node multi-GPU setups, regardless of whether it is broadcast, all-reduce, or all-to-all ... , can the transport only use net, one node just choose P2P or SHM?

5.Why does the output only show INFO level logs even when i set export NCCL_DEBUG=TRACE?

@huanrwan-amd
Copy link

huanrwan-amd commented Dec 30, 2024

Hi @Kyrienn , did you remove the older version of kernel driver? After that apply

sudo apt update
sudo apt upgrade

And update the driver again.
All amdgpu kernel driver could be found at /var/lib/dkms (default location)

In general, it is better to update to the latest driver and review the output log.
First, for Q.5, need to set the TRACE flag to ON: https://github.com/ROCm/rccl/blob/fd03b5b6a572911ce3bc24f94450a2720abdfdcc/CMakeLists.txt#L35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants