Peer-to-peer (P2P) communication allows direct communication over PCIe (or NVLINK, in some NVIDIA configurations) between devices. Given that it is not necessary to access the host in order to transfer data between devices, P2P communications provide a lower latency than traditional communications that do need to access the host.
Because P2P communication is done over PCIe/NVLINK, the availability of this type of communication among devices depends mostly on the PCIe/NVLINK topology existing.
In this example, the result of a matrix transpose kernel execution on one device is directly copied to the other one, showcasing how to carry out a P2P communication between two GPUs.
- P2P communication support is checked among the available devices. In case two of these devices are found to have it between them, they are selected for the example. A trace message informs about the IDs of the devices selected.
- The input and output matrices are allocated and initialized in host memory.
- The first device selected is set as the current device, device memory for the input and output matrices is allocated on the current device and the input data is copied from the host.
- A matrix transpose kernel using static shared memory is then launched on the current device using the previously defined arguments. A synchronization function is used to wait for the kernel to finish before continuing the host execution.
- The second device selected is set as the current device and the necessary amount of device memory for the input and output matrices for the second kernel execution is allocated on the current device.
- Direct memory access is enabled from the second device to the first one. This allows memory to be copied from the first device to the second with the usual memory-copy functions.
- The input data is copied to the current device from the output matrix allocated on the first device and a matrix transpose kernel using dynamic shared memory is then launched on the current device. A synchronization function is used to wait for the kernel to finish before continuing the host execution.
- The output matrix from this second kernel execution is copied back to host memory.
- Direct memory access from the second device to the first one is disabled.
- Device memory is freed.
- Results are validated and printed to the standard output.
hipGetDeviceCount
gives the number of devices available. In this example it allows to check if there is more than one device available.hipDeviceCanAccessPeer
queries whether a certain device can directly access the memory of a given peer device. A P2P communication is supported between two devices if this function returns true for those two devices.hipSetDevice
sets the specified device as the default device for the subsequent API calls. Such a device is then known as current device.- Once a current device is selected, and if a P2P communication is possible with a certain peer device,
hipDeviceEnablePeerAccess
can be used to enable access from the current device to the peer device's memory. WithhipDeviceDisablePeerAccess
it can also be disabled (provided thathipDeviceEnablePeerAccess
has already been called for the same current and peer devices). hipMalloc
allocates memory in the global memory of the device. WhenhipSetDevice
is called to set a specific current device, the subsequent calls tohipMalloc
will allocate memory in the current device's global memory. In the example it is showcased how to use these two functions to allocate memory on the two devices used.- With
hipMemcpy
data bytes can be transferred from host to device (usinghipMemcpyHostToDevice
), from device to host (usinghipMemcpyDeviceToHost
) or from device to device (usinghipMemcpyDeviceToDevice
). The latter will only work if P2P communication has been enabled from the destination to the source device. myKernelName<<<...>>>
queues the execution of a kernel in the current device andhipDeviceSynchronize
makes the host to wait on all active streams on the current device. In this examplehipDeviceSynchronize
is necessary because the second device needs the results obtained from the previous kernel execution on the first device.hipDeviceReset
discards the state of the current device and updates it to fresh one. It also frees all the resources (e.g. streams, events, ...) associated with the current device.
__global__
__shared__
blockDim
blockIdx
threadIdx
__syncthreads
hipDeviceCanAccessPeer
hipDeviceDisablePeerAccess
hipDeviceEnablePeerAccess
hipDeviceReset
hipDeviceSynchronize
hipFree
hipGetDeviceCount
hipMalloc
hipMemcpy
hipMemcpyDeviceToDevice
hipMemcpyDeviceToHost
hipMemcpyHostToDevice
hipSetDevice
hipStreamDefault
hipSuccess