[Issue]: Some questions focus on the transports in RCCL, including P2P, SHM, and NET #96

Kyrienn · 2024-12-20T07:04:15Z

Problem Description

Does RX7900XT support the transport of P2P ?

2.If I have a single machine with two AMD GPUs and run rccl-test/all_reduce_perf, is the transport mechanism selected as P2P first and then SHM? Why the bandwidth speed is same whether I set the environment variables by setting export NCCL_SHM_Disable=1, export NCCL_P2P_Disable=1, or not setting either of them?

3.What are the differences between the transport of P2P and SHM ? What additional hardware requirements must be met for P2P in rccl, xgmi ?

4.In multi-node multi-GPU setups, regardless of whether it is broadcast, all-reduce, or all-to-all ... , can the transport only use net, one node just choose P2P or SHM?

Why does the output only show INFO level logs even when i set export NCCL_DEBUG=TRACE?

Operating System

ubuntu-24.04

CPU

Intel(R) Core(TM) i7-14700K

GPU

2x AMD Radeon RX GPU 7900XT

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

ppanchad-amd · 2024-12-20T19:17:19Z

Hi @Kyrienn. Internal ticket has been created to assist with your questions. Thanks!

huanrwan-amd · 2024-12-23T21:11:44Z

Hi @Kyrienn , thanks for posting questions. Before we are proceeding with your questions, can you check your system/rocm software and hardware info with https://rocm.docs.amd.com/projects/rccl/en/develop/how-to/troubleshooting-rccl.html

rocminfo: for name of the GPU or accelerator
rocm-smi --showtopo: display the system topology
dkms status: to check amdgpu kernel driver version

Can you please post the above info first? Thanks.

Kyrienn · 2024-12-24T01:34:18Z

Hi @huanrwan-amd this is info
ROCk module version 6.8.5 is loaded

HSA System Attributes

Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents

Agent 1

Name: Intel(R) Core(TM) i7-14700K
Uuid: CPU-XX
Marketing Name: Intel(R) Core(TM) i7-14700K
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5500
BDFID: 0
Internal Node ID: 0
Compute Unit: 28
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65541592(0x3e815d8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65541592(0x3e815d8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65541592(0x3e815d8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 2

Name: gfx1100
Uuid: GPU-ebbfb7875510dab5
Marketing Name: Radeon RX 7900 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 81920(0x14000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2175
BDFID: 768
Internal Node ID: 1
Compute Unit: 84
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 342
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32

Agent 3

Name: gfx1100
Uuid: GPU-0376117fa45511c3
Marketing Name: Radeon RX 7900 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 81920(0x14000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2175
BDFID: 1536
Internal Node ID: 2
Compute Unit: 84
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 342
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

======================================= ROCm System Management Interface =======================================
================================================= Concise Info =================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)

0 1 0x744c, 29122 16.0°C 8.0W N/A, N/A, 0 27Mhz 96Mhz 0% auto 282.0W 5% 0%
1 2 0x744c, 8705 32.0°C 29.0W N/A, N/A, 0 42Mhz 96Mhz 0% auto 282.0W 3% 0%

============================================= End of ROCm SMI Log ==============================================

amdgpu/6.8.5-2070768.24.04, 6.8.0-49-generic, x86_64: installed
amdgpu/6.8.5-2070768.24.04, 6.8.0-51-generic, x86_64: installed

Kyrienn · 2024-12-24T02:16:29Z

============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
GPU0 GPU1
GPU0 0 40
GPU1 40 0

================================= Hops between two GPUs ==================================
GPU0 GPU1
GPU0 0 2
GPU1 2 0

=============================== Link Type between two GPUs ===============================
GPU0 GPU1
GPU0 0 PCIE
GPU1 PCIE 0

======================================= Numa Nodes =======================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: -1
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================
the link type shown is PCIE and why weight is 40？what is the general standard for weight with AMD device ?

huanrwan-amd · 2024-12-24T16:48:31Z

Hi @Kyrienn, can you update the amdgpu kernel driver with amdgpu installer: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#install-amdgpu-dkms .
To this time 2024.12.24, the latest should be amdgpu/6.10.5-2084815.22.04

huanrwan-amd · 2024-12-24T16:50:20Z

============================ ROCm System Management Interface ============================ ================================ Weight between two GPUs ================================= GPU0 GPU1 GPU0 0 40 GPU1 40 0

================================= Hops between two GPUs ================================== GPU0 GPU1 GPU0 0 2 GPU1 2 0

=============================== Link Type between two GPUs =============================== GPU0 GPU1 GPU0 0 PCIE GPU1 PCIE 0

======================================= Numa Nodes ======================================= GPU[0] : (Topology) Numa Node: 0 GPU[0] : (Topology) Numa Affinity: -1 GPU[1] : (Topology) Numa Node: 0 GPU[1] : (Topology) Numa Affinity: -1 ================================== End of ROCm SMI Log =================================== the link type shown is PCIE and why weight is 40？what is the general standard for weight with AMD device ?

The weight value 40 is typical for a PCIe connections. For xGMI connection, the weight is lower, around 10-20.

Kyrienn · 2024-12-25T02:21:36Z

Hi @Kyrienn, can you update the amdgpu kernel driver with amdgpu installer: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#install-amdgpu-dkms . To this time 2024.12.24, the latest should be amdgpu/6.10.5-2084815.22.04

after amdgpu-install --usecase=dkms， it is still 6.8.5

Building dependency tree... Done
Reading state information... Done
amdgpu-dkms is already the newest version (1:6.8.5.60204-2070768.24.04).
linux-headers-6.8.0-49-generic is already the newest version (6.8.0-49.49).
0 upgraded, 0 newly installed, 0 to remove and 127 not upgraded.
(base) kyriechen@kyriechen-System-Product-Name:~/rccl-tests/build$ sudo dkms status
amdgpu/6.8.5-2070768.24.04, 6.8.0-49-generic, x86_64: installed
amdgpu/6.8.5-2070768.24.04, 6.8.0-51-generic, x86_64: installed

Kyrienn · 2024-12-25T09:01:31Z

1.Does RX7900XT support the transport of P2P ?

2.If I have a single machine with two AMD GPUs and run rccl-test/all_reduce_perf, is the transport mechanism selected as P2P first and then SHM? Why the bandwidth speed is same whether I set the environment variables by setting export NCCL_SHM_Disable=1, export NCCL_P2P_Disable=1, or not setting either of them?

3.What are the differences between the transport of P2P and SHM ? What additional hardware requirements must be met for P2P in rccl, xgmi ?

4.In multi-node multi-GPU setups, regardless of whether it is broadcast, all-reduce, or all-to-all ... , can the transport only use net, one node just choose P2P or SHM?

5.Why does the output only show INFO level logs even when i set export NCCL_DEBUG=TRACE?

huanrwan-amd · 2024-12-30T16:52:03Z

Hi @Kyrienn , did you remove the older version of kernel driver? After that apply

sudo apt update
sudo apt upgrade

And update the driver again.
All amdgpu kernel driver could be found at /var/lib/dkms (default location)

In general, it is better to update to the latest driver and review the output log.
First, for Q.5, need to set the TRACE flag to ON: https://github.com/ROCm/rccl/blob/fd03b5b6a572911ce3bc24f94450a2720abdfdcc/CMakeLists.txt#L35

ppanchad-amd added the Under Investigation label Dec 20, 2024

Kyrienn closed this as completed Dec 25, 2024

Kyrienn reopened this Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Some questions focus on the transports in RCCL, including P2P, SHM, and NET #96

[Issue]: Some questions focus on the transports in RCCL, including P2P, SHM, and NET #96

Kyrienn commented Dec 20, 2024 •

edited by huanrwan-amd

Loading

ppanchad-amd commented Dec 20, 2024

huanrwan-amd commented Dec 23, 2024

Kyrienn commented Dec 24, 2024

Kyrienn commented Dec 24, 2024

huanrwan-amd commented Dec 24, 2024

huanrwan-amd commented Dec 24, 2024 •

edited

Loading

Kyrienn commented Dec 25, 2024

Kyrienn commented Dec 25, 2024

huanrwan-amd commented Dec 30, 2024 •

edited

Loading

[Issue]: Some questions focus on the transports in RCCL, including P2P, SHM, and NET #96

[Issue]: Some questions focus on the transports in RCCL, including P2P, SHM, and NET #96

Comments

Kyrienn commented Dec 20, 2024 • edited by huanrwan-amd Loading

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

ppanchad-amd commented Dec 20, 2024

huanrwan-amd commented Dec 23, 2024

Kyrienn commented Dec 24, 2024

Hi @huanrwan-amd this is info ROCk module version 6.8.5 is loaded

HSA System Attributes

========== HSA Agents

0 1 0x744c, 29122 16.0°C 8.0W N/A, N/A, 0 27Mhz 96Mhz 0% auto 282.0W 5% 0% 1 2 0x744c, 8705 32.0°C 29.0W N/A, N/A, 0 42Mhz 96Mhz 0% auto 282.0W 3% 0%

Kyrienn commented Dec 24, 2024

huanrwan-amd commented Dec 24, 2024

huanrwan-amd commented Dec 24, 2024 • edited Loading

Kyrienn commented Dec 25, 2024

Kyrienn commented Dec 25, 2024

huanrwan-amd commented Dec 30, 2024 • edited Loading

Kyrienn commented Dec 20, 2024 •

edited by huanrwan-amd

Loading

Hi @huanrwan-amd this is info
ROCk module version 6.8.5 is loaded

==========
HSA Agents

0 1 0x744c, 29122 16.0°C 8.0W N/A, N/A, 0 27Mhz 96Mhz 0% auto 282.0W 5% 0%
1 2 0x744c, 8705 32.0°C 29.0W N/A, N/A, 0 42Mhz 96Mhz 0% auto 282.0W 3% 0%

huanrwan-amd commented Dec 24, 2024 •

edited

Loading

huanrwan-amd commented Dec 30, 2024 •

edited

Loading