Skip to content
This repository has been archived by the owner on Jan 26, 2024. It is now read-only.

Commit

Permalink
SWDEV-362046 - Report HIP_OPS activities using the ROCr driver_node_i…
Browse files Browse the repository at this point in the history
…d instead of the device's index

The ROCclr assigns zero-based IDs to GPUs in the order they are
discovered. That zero-based ID is what is used to identify the GPU
on which the HIP_OPS activity took place.

When multiple ranks are used, each rank's first logical device always
has GPU ID 0, regardless of which physical device is selected with
CUDA_VISIBLE_DEVICES. Because of this, when merging trace files from
multiple ranks, GPU IDs from different processes may overlap.

The long term solution is to use the KFD's gpu_id which is stable
across APIs and processes. Unfortunately the gpu_id is not yet exposed
by the ROCr, so for now use the driver's node id.

Change-Id: Ib78854527d600d175bb76e2df0747c33f898c615
(cherry picked from commit 7de8e6b)
  • Loading branch information
lmoriche authored and zhang2amd committed Nov 8, 2022
1 parent 1e7d894 commit c924ab1
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 2 deletions.
2 changes: 2 additions & 0 deletions device/device.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,8 @@ struct Info : public amd::EmbeddedObject {

bool virtualMemoryManagement_; //!< Virtual memory management support
size_t virtualMemAllocGranularity_; //!< virtual memory allocation size/addr granularity

uint32_t driverNodeId_;
};

//! Device settings
Expand Down
7 changes: 7 additions & 0 deletions device/rocm/rocdevice.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1179,6 +1179,13 @@ bool Device::populateOCLDeviceConstants() {
}
assert(info_.globalMemChannels_ > 0);

if (HSA_STATUS_SUCCESS !=
hsa_agent_get_info(bkendDevice_,
static_cast<hsa_agent_info_t>(HSA_AMD_AGENT_INFO_DRIVER_NODE_ID),
&info_.driverNodeId_)) {
return false;
}

setupCpuAgent();

checkAtomicSupport();
Expand Down
4 changes: 2 additions & 2 deletions platform/activity.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,8 @@ void ReportActivity(const amd::Command& command) {
command.profilingInfo().start_, // begin timestamp, ns
command.profilingInfo().end_, // end timestamp, ns
{{
static_cast<int>(queue->device().index()), // device id
queue->vdev()->index() // queue id
static_cast<int>(queue->device().info().driverNodeId_), // device id
queue->vdev()->index() // queue id
}},
{} // copied data size for memcpy, or kernel name for dispatch
};
Expand Down

0 comments on commit c924ab1

Please sign in to comment.