Skip to content

Commit

Permalink
SDK doc updates (#1183)
Browse files Browse the repository at this point in the history
* correcting usage example

* rccl trace

* Adding Navi power state limitation

* Addressed feedback

* kernel-rename

* kokkos trace

* more information on kookos tracing

* Corecting tool library hardcoding

* summary domains

* Updating domain stats file

* updating images

* rocprofv3 default behavior update

* Removing README from API documentation

* Added missing description in Topics

* Fixed wrong rendering of README in API document

* Fixing Topics in API docs

* Removing API doc for details/rccl.h

* Addressed review comments
  • Loading branch information
bgopesh authored Nov 22, 2024
1 parent 8d2ce4b commit 7ea9ced
Show file tree
Hide file tree
Showing 13 changed files with 234 additions and 43 deletions.
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,13 +73,24 @@ Please report in the Github Issues.
- **Need for Cold Restart**: In the event of a hardware freeze, you may need to perform a cold restart (turning the hardware off and on) to restore normal operations.
Please use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.

- At this point, We do not recommend stress-testing the beta implementation.
- At this point, We do not recommend stress-testing the beta implementation.

- Correlation IDs provided by the PC sampling service are verified only for HIP API calls.
- Correlation IDs provided by the PC sampling service are verified only for HIP API calls.

- Timestamps in PC sampling records might not be 100% accurate.
- Timestamps in PC sampling records might not be 100% accurate.

- Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear.
- Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear.

- Navi3x requires a stable power state for counter collection.
Currently, this state needs to be set by the user.
To do so, set "power_dpm_force_performance_level" to be writeable for non-root users, then set performance level to profile_standard:

```bash
sudo chmod 777 /sys/class/drm/card0/device/power_dpm_force_performance_level
echo profile_standard >> /sys/class/drm/card0/device/power_dpm_force_performance_level
```

Recommended: "profile_standard" for counter collection and "auto" for all other profiling. Use rocm-smi to verify the current power state. For multiGPU systems (includes integrated graphics), replace "card0" by the desired card.

> [!WARNING]
> The latest mainline version of AQLprofile can be found at [https://repo.radeon.com/rocm/misc/aqlprofile/](https://repo.radeon.com/rocm/misc/aqlprofile/). However, it's important to note that updates to the public AQLProfile may not occur as frequently as updates to the rocprofiler-sdk. This discrepancy could lead to a potential mismatch between the AQLprofile binary and the rocprofiler-sdk source.
2 changes: 1 addition & 1 deletion source/docs/api-reference/tool_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ myst:

# ROCprofiler-SDK tool library

The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.0.4.0`, which uses these libraries.
The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.X.Y.Z`, which uses these libraries.

## ROCm runtimes design

Expand Down
8 changes: 7 additions & 1 deletion source/docs/conceptual/comparing-with-legacy-tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -383,4 +383,10 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
Timing Difference Between rocprofv3 and rocprofv1/v2
========================================================

Rocprofv3 has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate.
``rocprofv3`` has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate.

========================================================
Default run of rocprofv3 and rocprofv1/v2
========================================================

``rocprofv3`` has a different default behavior than rocprofv1/v2 when being run without any option. The default behavior of rocprofv3 is to collect all available agents on the system and to output it in ``csv`` format. The default behavior of rocprofv1/v2 was to output the `kernel traces` in CSV format. In rocprofv3, kernel traces can be obtained by using ``--kernel-trace`` option.
2 changes: 2 additions & 0 deletions source/docs/data/hip_domain_stats.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"HIP_API",13,458514859,35270373.769231,100.00,2300,352276613,99315857.546240
22 changes: 22 additions & 0 deletions source/docs/data/rccl_trace.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"RCCL_API","ncclGetVersion",1834151,1834151,416,18413845573432,18413845577374
"RCCL_API","ncclGetUniqueId",1834151,1834151,1116,18413961300878,18413963267869
"RCCL_API","ncclGetUniqueId",1834151,1834151,1481,18414166449182,18414166720831
"RCCL_API","ncclGroupStart",1834151,1834151,1482,18414166723772,18414166726834
"RCCL_API","ncclGroupEnd",1834151,1834151,1490,18414166823575,18414380520973
"RCCL_API","ncclCommInitAll",1834151,1834151,1477,18414166402665,18414380522536
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89098,18414380660695,18414380661652
"RCCL_API","ncclAllReduce",1834151,1834151,89097,18414380653860,18414380693574
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89108,18414380694631,18414380694659
"RCCL_API","ncclAllReduce",1834151,1834151,89107,18414380694212,18414380704722
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89117,18414380706650,18414380706677
"RCCL_API","ncclAllReduce",1834151,1834151,89116,18414380705574,18414380715055
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89126,18414380715749,18414380715774
"RCCL_API","ncclAllReduce",1834151,1834151,89125,18414380715463,18414380723944
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89135,18414380724688,18414380724715
"RCCL_API","ncclAllReduce",1834151,1834151,89134,18414380724395,18414380732209
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89154,18414380746383,18414380746411
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89157,18414380749863,18414380749889
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89160,18414380751671,18414380751696
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89163,18414380753326,18414380753353
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89166,18414380755128,18414380755154
Binary file added source/docs/data/rocprofv3_hip_memcpy_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added source/docs/data/rocprofv3_memcpy_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added source/docs/data/rocprofv3_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
184 changes: 165 additions & 19 deletions source/docs/how-to/using-rocprofv3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ To use ``rocprofv3`` for application tracing, run:

.. code-block:: bash
rocprofv3 <tracing_option> -- <app_relative_path>
rocprofv3 <tracing_option> -- <application_path>
HIP trace
+++++++++++
Expand All @@ -184,7 +184,7 @@ To trace HIP runtime APIs, use:

.. code-block:: bash
rocprofv3 --hip-trace -- < app_relative_path >
rocprofv3 --hip-trace -- <application_path>
The above command generates a ``hip_api_trace.csv`` file prefixed with the process ID.

Expand All @@ -203,7 +203,7 @@ To trace HIP compile time APIs, use:

.. code-block:: shell
rocprofv3 --hip-compiler-trace -- < app_relative_path >
rocprofv3 --hip-compiler-trace -- <application_path>
The above command generates a ``hip_api_trace.csv`` file prefixed with the process ID.

Expand All @@ -229,7 +229,7 @@ HSA trace contains the start and end time of HSA runtime API calls and their asy

.. code-block:: bash
rocprofv3 --hsa-trace -- < app_relative_path >
rocprofv3 --hsa-trace -- <application_path>
The above command generates a ``hsa_api_trace.csv`` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes.

Expand Down Expand Up @@ -295,7 +295,7 @@ To trace the API calls enclosed within the range, use:

.. code-block:: bash
rocprofv3 --marker-trace -- < app_relative_path >
rocprofv3 --marker-trace -- <application_path>
Running the preceding command generates a ``marker_api_trace.csv`` file prefixed with the process ID.

Expand All @@ -312,14 +312,82 @@ Here are the contents of ``marker_api_trace.csv`` file:

For the description of the fields in the output file, see :ref:`output-file-fields`.

Kernel Rename
++++++++++++++

To rename kernels with their enclosing roctxRangePush/roctxRangePop message. Known as --roctx-rename in earlier rocprof versions.

See how to use ``--kernel-rename`` option with help of below code snippet:

.. code-block:: bash
#include <rocprofiler-sdk-roctx/roctx.h>
roctxRangePush("HIP_Kernel-1");
// Launching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
// Memory transfer from device to host
roctxRangePush("hipMemCpy-DeviceToHost");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
To rename the kernel , use:

.. code-block:: bash
rocprofv3 --marker-trace --kernel-rename -- <application_path>
The above command generates a ``marker-trace`` file prefixed with the process ID.

.. code-block:: shell
$ cat 210_marker_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247
"MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502
"MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267
"MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787
"MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586
"MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270
"MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485
Kokkos Trace
++++++++++++++

rocprofv3 has a built-in `Kokkos Tools library <https://github.com/kokkos/kokkos-tools>`_ support to trace Kokkos API calls. `Kokkos <https://github.com/kokkos/kokkos>`_ is a C++ library for writing performance portable applications. It is used in many scientific applications to write performance portable code that can run on CPUs, GPUs, and other accelerators.
rocprofv3 loads a built-in Kokkos tools library which emits roctx ranges with the labels passed through the API, e.g. Kokkos::parallel_for(“MyParallelForLabel”, …); will internally calls for roctxRangePush and enables the kernel renaming option so that the highly templated kernel names are replaced by the Kokkos labels.
To enable built-in marker support, use the ``kokkos-trace`` option. Internally this option enables ``marker-trace`` and ``kernel-rename``.:

.. code-block:: bash
rocprofv3 --kokkos-trace -- <application_path>
The above command generates a ``marker-trace`` file prefixed with the process ID.

.. code-block:: shell
$ cat 210_marker_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","Kokkos::Initialization Complete",4069256,4069256,1,56728499773965,56728499773965
"MARKER_CORE_API","Kokkos::Impl::CombinedFunctorReducer<CountFunctor, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Serial>, CountFunctor, long int>::Reducer, void>",4069256,4069256,2,56728501756088,56728501764241
"MARKER_CORE_API","Kokkos::parallel_reduce: fence due to result being value, not view",4069256,4069256,4,56728501767957,56728501769600
"MARKER_CORE_API","Kokkos::Finalization Complete",4069256,4069256,6,56728502054554,56728502054554
Kernel trace
++++++++++++++

To trace kernel dispatch traces, use:

.. code-block:: shell
rocprofv3 --kernel-trace -- < app_relative_path >
rocprofv3 --kernel-trace -- <application_path>
The above command generates a ``kernel_trace.csv`` file prefixed with the process ID.

Expand All @@ -343,7 +411,7 @@ To trace memory moves across the application, use:

.. code-block:: shell
rocprofv3 –-memory-copy-trace -- < app_relative_path >
rocprofv3 –-memory-copy-trace -- <application_path>
The above command generates a ``memory_copy_trace.csv`` file prefixed with the process ID.

Expand Down Expand Up @@ -400,7 +468,7 @@ memory operations (copies and scratch).

.. code-block:: shell
rocprofv3 –-runtime-trace -- < app_relative_path >
rocprofv3 –-runtime-trace -- <application_path>
Running the above command generates ``hip_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``scratch_memory_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application) files prefixed with the process ID.

Expand All @@ -411,7 +479,7 @@ This is an all-inclusive option to collect all the above-mentioned traces.

.. code-block:: shell
rocprofv3 –-sys-trace -- < app_relative_path >
rocprofv3 –-sys-trace -- <application_path>
Running the above command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application) files prefixed with the process ID.

Expand All @@ -422,19 +490,45 @@ This option collects scratch memory operation's traces. Scratch is an address sp

.. code-block:: shell
rocprofv3 --scratch-memory-trace -- < app_relative_path >
rocprofv3 --scratch-memory-trace -- <application_path>
Stats
++++++++
RCCL trace
++++++++++++

`RCCL <https://github.com/ROCm/rccl>`_ (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs. This option traces those communication routines.

.. code-block:: shell
rocprofv3 --rccl-trace -- <application_path>
The above command generates a ``rccl_api_trace`` file prefixed with the process ID.

.. code-block:: shell
$ cat 197_rccl_api_trace.csv
Here are the contents of ``rccl_api_trace.csv`` file:

.. csv-table:: RCCL trace
:file: /data/rccl_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1

Post-processing tracing options
++++++++++++++++++++++++++++++++

1. Stats
+++++++++

This option collects statistics for the enabled tracing types. For example, to collect statistics of HIP APIs, when HIP trace is enabled.
A higher percentage in statistics can help user focus on the API/function that has taken the most time:

.. code-block:: shell
rocprofv3 --stats --hip-trace -- < app_relative_path >
rocprofv3 --stats --hip-trace -- <application_path>
The above command generates a ``hip_api_stats.csv`` and ``hip_api_trace`` file prefixed with the process ID.
The above command generates a ``hip_api_stats.csv``, ``domain_stats.csv`` and ``hip_api_trace.csv`` file prefixed with the process ID.

.. code-block:: shell
Expand All @@ -447,8 +541,60 @@ Here are the contents of ``hip_api_stats.csv`` file:
:widths: 10,10,20,20,10,10,10,10
:header-rows: 1

Here are the contents of ``domain_stats.csv`` file:

.. csv-table:: Domain stats
:file: /data/hip_domain_stats.csv
:widths: 10,10,20,20,10,10,10,10
:header-rows: 1

For the description of the fields in the output file, see :ref:`output-file-fields`.

2. Summary
+++++++++++

Output single summary of tracing data at the conclusion of the profiling session

.. code-block:: shell
rocprofv3 -S --hip-trace -- <application_path>
.. image:: /data/rocprofv3_summary.png


2.1 Summary per domain
++++++++++++++++++++++

Outputs the summary of each tracing domain at the end of profiling session.

.. code-block:: shell
rocprofv3 -D --hsa-trace --hip-trace -- <application_path>
The above command generates a ``hip_trace.csv``, ``hsa_trace.csv`` file prefixed with the process ID along with the summary of each domain at the terminal.

2.2 Summary groups
+++++++++++++++++++

Users can create a summary of multiple domains by specifying the domain names in the command line. The summary groups are separated by a pipe (|) symbol.
To create a summary for ``MEMORY_COPY`` domains, use:
.. code-block:: shell
rocprofv3 --summary-groups MEMORY_COPY --sys-trace -- <application_path>
.. image:: /data/rocprofv3_memcpy_summary.png


To create a summary for ``MEMORY_COPY`` and ``HIP_API`` domains, use:

.. code-block:: shell
rocprofv3 --summary-groups 'MEMORY_COPY|HIP_API' --sys-trace -- <application_path>
.. image:: /data/rocprofv3_hip_memcpy_summary.png


Kernel profiling
-------------------

Expand Down Expand Up @@ -540,7 +686,7 @@ Properties
{
"jobs": [
{
"pmc": ["SQ_WAVES", "GRBM_COUNT", "GUI_ACTIVE"]
"pmc": ["SQ_WAVES", "GRBM_COUNT", "GRBM_GUI_ACTIVE"]
},
{
"pmc": ["FETCH_SIZE", "WRITE_SIZE"],
Expand All @@ -564,7 +710,7 @@ Properties
- pmc:
- SQ_WAVES
- GRBM_COUNT
- GUI_ACTIVE
- GRBM_GUI_ACTIVE
- 'TCC_HIT[1]'
- 'TCC_HIT[2]'
- pmc:
Expand All @@ -581,7 +727,7 @@ To supply the counters via ``command-line`` options, use:
.. code-block:: shell
rocprofv3 --pmc SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE -- <app_relative_path>
rocprofv3 --pmc SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE -- <application_path>
.. note::
1. Please note that more than 1 counters should be separated by a space or a comma.
Expand All @@ -594,7 +740,7 @@ To supply the input file for kernel profiling, use:
.. code-block:: shell
rocprofv3 -i input.txt -- <app_relative_path>
rocprofv3 -i input.txt -- <application_path>
Running the above command generates a ``./pmc_n/counter_collection.csv`` file prefixed with the process ID. For each ``pmc`` row, a directory ``pmc_n`` containing a ``counter_collection.csv`` file is generated, where n = 1 for the first row and so on.
Expand Down Expand Up @@ -666,7 +812,7 @@ To collect counters for the kernels matching the filters specified in the preced
.. code-block:: shell
rocprofv3 -i input.yml -- <app_relative_path>
rocprofv3 -i input.yml -- <application_path>
$ cat pass_1/312_counter_collection.csv
"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
Expand Down
Loading

0 comments on commit 7ea9ced

Please sign in to comment.