diff --git a/README.md b/README.md index 8d2a5006..40ddcfa0 100644 --- a/README.md +++ b/README.md @@ -73,13 +73,24 @@ Please report in the Github Issues. - **Need for Cold Restart**: In the event of a hardware freeze, you may need to perform a cold restart (turning the hardware off and on) to restore normal operations. Please use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk. -- At this point, We do not recommend stress-testing the beta implementation. + - At this point, We do not recommend stress-testing the beta implementation. -- Correlation IDs provided by the PC sampling service are verified only for HIP API calls. + - Correlation IDs provided by the PC sampling service are verified only for HIP API calls. -- Timestamps in PC sampling records might not be 100% accurate. + - Timestamps in PC sampling records might not be 100% accurate. -- Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear. + - Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear. + +- Navi3x requires a stable power state for counter collection. + Currently, this state needs to be set by the user. + To do so, set "power_dpm_force_performance_level" to be writeable for non-root users, then set performance level to profile_standard: + + ```bash + sudo chmod 777 /sys/class/drm/card0/device/power_dpm_force_performance_level + echo profile_standard >> /sys/class/drm/card0/device/power_dpm_force_performance_level + ``` + + Recommended: "profile_standard" for counter collection and "auto" for all other profiling. Use rocm-smi to verify the current power state. For multiGPU systems (includes integrated graphics), replace "card0" by the desired card. > [!WARNING] > The latest mainline version of AQLprofile can be found at [https://repo.radeon.com/rocm/misc/aqlprofile/](https://repo.radeon.com/rocm/misc/aqlprofile/). However, it's important to note that updates to the public AQLProfile may not occur as frequently as updates to the rocprofiler-sdk. This discrepancy could lead to a potential mismatch between the AQLprofile binary and the rocprofiler-sdk source. diff --git a/source/docs/api-reference/tool_library.md b/source/docs/api-reference/tool_library.md index cce13d50..9ce1660e 100644 --- a/source/docs/api-reference/tool_library.md +++ b/source/docs/api-reference/tool_library.md @@ -7,7 +7,7 @@ myst: # ROCprofiler-SDK tool library -The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.0.4.0`, which uses these libraries. +The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.X.Y.Z`, which uses these libraries. ## ROCm runtimes design diff --git a/source/docs/conceptual/comparing-with-legacy-tools.rst b/source/docs/conceptual/comparing-with-legacy-tools.rst index 619a244c..16651196 100644 --- a/source/docs/conceptual/comparing-with-legacy-tools.rst +++ b/source/docs/conceptual/comparing-with-legacy-tools.rst @@ -383,4 +383,10 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more Timing Difference Between rocprofv3 and rocprofv1/v2 ======================================================== -Rocprofv3 has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate. +``rocprofv3`` has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate. + +======================================================== +Default run of rocprofv3 and rocprofv1/v2 +======================================================== + +``rocprofv3`` has a different default behavior than rocprofv1/v2 when being run without any option. The default behavior of rocprofv3 is to collect all available agents on the system and to output it in ``csv`` format. The default behavior of rocprofv1/v2 was to output the `kernel traces` in CSV format. In rocprofv3, kernel traces can be obtained by using ``--kernel-trace`` option. diff --git a/source/docs/data/hip_domain_stats.csv b/source/docs/data/hip_domain_stats.csv new file mode 100644 index 00000000..55cd0fc2 --- /dev/null +++ b/source/docs/data/hip_domain_stats.csv @@ -0,0 +1,2 @@ +"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev" +"HIP_API",13,458514859,35270373.769231,100.00,2300,352276613,99315857.546240 \ No newline at end of file diff --git a/source/docs/data/rccl_trace.csv b/source/docs/data/rccl_trace.csv new file mode 100644 index 00000000..6fee219c --- /dev/null +++ b/source/docs/data/rccl_trace.csv @@ -0,0 +1,22 @@ +"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp" +"RCCL_API","ncclGetVersion",1834151,1834151,416,18413845573432,18413845577374 +"RCCL_API","ncclGetUniqueId",1834151,1834151,1116,18413961300878,18413963267869 +"RCCL_API","ncclGetUniqueId",1834151,1834151,1481,18414166449182,18414166720831 +"RCCL_API","ncclGroupStart",1834151,1834151,1482,18414166723772,18414166726834 +"RCCL_API","ncclGroupEnd",1834151,1834151,1490,18414166823575,18414380520973 +"RCCL_API","ncclCommInitAll",1834151,1834151,1477,18414166402665,18414380522536 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89098,18414380660695,18414380661652 +"RCCL_API","ncclAllReduce",1834151,1834151,89097,18414380653860,18414380693574 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89108,18414380694631,18414380694659 +"RCCL_API","ncclAllReduce",1834151,1834151,89107,18414380694212,18414380704722 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89117,18414380706650,18414380706677 +"RCCL_API","ncclAllReduce",1834151,1834151,89116,18414380705574,18414380715055 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89126,18414380715749,18414380715774 +"RCCL_API","ncclAllReduce",1834151,1834151,89125,18414380715463,18414380723944 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89135,18414380724688,18414380724715 +"RCCL_API","ncclAllReduce",1834151,1834151,89134,18414380724395,18414380732209 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89154,18414380746383,18414380746411 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89157,18414380749863,18414380749889 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89160,18414380751671,18414380751696 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89163,18414380753326,18414380753353 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89166,18414380755128,18414380755154 diff --git a/source/docs/data/rocprofv3_hip_memcpy_summary.png b/source/docs/data/rocprofv3_hip_memcpy_summary.png new file mode 100644 index 00000000..c7b048a7 Binary files /dev/null and b/source/docs/data/rocprofv3_hip_memcpy_summary.png differ diff --git a/source/docs/data/rocprofv3_memcpy_summary.png b/source/docs/data/rocprofv3_memcpy_summary.png new file mode 100644 index 00000000..22f1f513 Binary files /dev/null and b/source/docs/data/rocprofv3_memcpy_summary.png differ diff --git a/source/docs/data/rocprofv3_summary.png b/source/docs/data/rocprofv3_summary.png new file mode 100644 index 00000000..7ea55c5b Binary files /dev/null and b/source/docs/data/rocprofv3_summary.png differ diff --git a/source/docs/how-to/using-rocprofv3.rst b/source/docs/how-to/using-rocprofv3.rst index 97b6ce40..17502244 100644 --- a/source/docs/how-to/using-rocprofv3.rst +++ b/source/docs/how-to/using-rocprofv3.rst @@ -169,7 +169,7 @@ To use ``rocprofv3`` for application tracing, run: .. code-block:: bash - rocprofv3 -- + rocprofv3 -- HIP trace +++++++++++ @@ -180,7 +180,7 @@ To trace HIP runtime APIs, use: .. code-block:: bash - rocprofv3 --hip-trace -- < app_relative_path > + rocprofv3 --hip-trace -- The above command generates a ``hip_api_trace.csv`` file prefixed with the process ID. @@ -199,7 +199,7 @@ To trace HIP compile time APIs, use: .. code-block:: shell - rocprofv3 --hip-compiler-trace -- < app_relative_path > + rocprofv3 --hip-compiler-trace -- The above command generates a ``hip_api_trace.csv`` file prefixed with the process ID. @@ -225,7 +225,7 @@ HSA trace contains the start and end time of HSA runtime API calls and their asy .. code-block:: bash - rocprofv3 --hsa-trace -- < app_relative_path > + rocprofv3 --hsa-trace -- The above command generates a ``hsa_api_trace.csv`` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes. @@ -256,6 +256,14 @@ Here is a list of useful APIs for code instrumentation. - ``roctxRangePush``: Starts a new nested range. - ``roctxRangePop``: Stops the current nested range. - ``roctxRangeStop``: Stops the given range. +- ``roctxProfilerPause``: Request any currently running profiling tool that it should stop collecting data. +- ``roctxProfilerResume``: Request any currently running profiling tool that it should resume collecting data. +- ``roctxGetThreadId``: Retrieve a id value for the current thread which will be identical to the id value a profiling tool gets via `rocprofiler_get_thread_id(rocprofiler_thread_id_t*)`. +- ``roctxNameOsThread``: Current CPU OS thread to be labeled by the provided name in the output of the profiling tool. +- ``roctxNameHsaAgent``: Given HSA agent to be labeled by the provided name in the output of the profiling tool. +- ``roctxNameHipDevice``: Given HIP device id to be labeled by the provided name in the output of the profiling tool. +- ``roctxNameHipStream``: Given HIP stream to be labeled by the provided name in the output of the profiling tool. + .. note:: To use ``rocprofv3`` for marker tracing, including and linking to old ROCTx works but it is recommended to switch to new ROCTx because @@ -291,7 +299,7 @@ To trace the API calls enclosed within the range, use: .. code-block:: bash - rocprofv3 --marker-trace -- < app_relative_path > + rocprofv3 --marker-trace -- Running the preceding command generates a ``marker_api_trace.csv`` file prefixed with the process ID. @@ -308,6 +316,127 @@ Here are the contents of ``marker_api_trace.csv`` file: For the description of the fields in the output file, see :ref:`output-file-fields`. +``roctxProfilerPause`` and ``roctxProfilerResume`` can be used to hide the calls between them. This is useful when you want to hide the calls that are not relevant to your profiling session. + +.. code-block:: bash + + #include + + // Memory transfer from host to device + HIP_API_CALL(hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice)); + + auto tid = roctx_thread_id_t{}; + roctxGetThreadId(&tid); + roctxProfilerPause(tid); + // Memory transfer that should be hidden by profiling tool + HIP_API_CALL( + hipMemcpy(gpuTransposeMatrix, gpuMatrix, NUM * sizeof(float), hipMemcpyDeviceToDevice)); + roctxProfilerResume(tid); + + // Lauching kernel from host + hipLaunchKernelGGL(matrixTranspose, + dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y), + dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), + 0, + 0, + gpuTransposeMatrix, + gpuMatrix, + WIDTH); + + // Memory transfer from device to host + HIP_API_CALL( + hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost)); + +.. code-block:: shell + + rocprofv3 --marker-trace --hip-trace -- + + The above command generates a ``hip_api_trace.csv`` file prefixed with the process ID, which has only 2 `hipMemcpy` calls and the in between ``hipMemcpyDeviceToHost`` is hidden . + +.. code-block:: shell + + "Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp" + "HIP_COMPILER_API","__hipRegisterFatBinary",1643920,1643920,1,320301257609216,320301257636427 + "HIP_COMPILER_API","__hipRegisterFunction",1643920,1643920,2,320301257650707,320301257678857 + "HIP_RUNTIME_API","hipGetDevicePropertiesR0600",1643920,1643920,4,320301258114239,320301337764472 + "HIP_RUNTIME_API","hipMalloc",1643920,1643920,5,320301338073823,320301338247374 + "HIP_RUNTIME_API","hipMalloc",1643920,1643920,6,320301338248284,320301338399595 + "HIP_RUNTIME_API","hipMemcpy",1643920,1643920,7,320301338410995,320301631549262 + "HIP_COMPILER_API","__hipPushCallConfiguration",1643920,1643920,10,320301632131175,320301632134215 + "HIP_COMPILER_API","__hipPopCallConfiguration",1643920,1643920,11,320301632137745,320301632139735 + "HIP_RUNTIME_API","hipLaunchKernel",1643920,1643920,12,320301632142615,320301632898289 + "HIP_RUNTIME_API","hipMemcpy",1643920,1643920,14,320301632901249,320301633934395 + "HIP_RUNTIME_API","hipFree",1643920,1643920,15,320301643320908,320301643511479 + "HIP_RUNTIME_API","hipFree",1643920,1643920,16,320301643512629,320301643585639 + +Kernel Rename +++++++++++++++ + +To rename kernels with their enclosing roctxRangePush/roctxRangePop message. Known as --roctx-rename in earlier rocprof versions. + +See how to use ``--kernel-rename`` option with help of below code snippet: + +.. code-block:: bash + + #include + + roctxRangePush("HIP_Kernel-1"); + + // Launching kernel from host + hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH); + + // Memory transfer from device to host + roctxRangePush("hipMemCpy-DeviceToHost"); + + hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost); + + roctxRangePop(); // for "hipMemcpy" + roctxRangePop(); // for "hipLaunchKernel" + roctxRangeStop(rangeId); + +To rename the kernel , use: + +.. code-block:: bash + + rocprofv3 --marker-trace --kernel-rename -- + +The above command generates a ``marker-trace`` file prefixed with the process ID. + +.. code-block:: shell + + $ cat 210_marker_api_trace.csv + "Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp" + "MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247 + "MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502 + "MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267 + "MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787 + "MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586 + "MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270 + "MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485 + + +Kokkos Trace +++++++++++++++ + +rocprofv3 has a built-in `Kokkos Tools library `_ support to trace Kokkos API calls. `Kokkos `_ is a C++ library for writing performance portable applications. It is used in many scientific applications to write performance portable code that can run on CPUs, GPUs, and other accelerators. +rocprofv3 loads a built-in Kokkos tools library which emits roctx ranges with the labels passed through the API, e.g. Kokkos::parallel_for(“MyParallelForLabel”, …); will internally calls for roctxRangePush and enables the kernel renaming option so that the highly templated kernel names are replaced by the Kokkos labels. +To enable built-in marker support, use the ``kokkos-trace`` option. Internally this option enables ``marker-trace`` and ``kernel-rename``.: + +.. code-block:: bash + + rocprofv3 --kokkos-trace -- + +The above command generates a ``marker-trace`` file prefixed with the process ID. + +.. code-block:: shell + + $ cat 210_marker_api_trace.csv + "Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp" + "MARKER_CORE_API","Kokkos::Initialization Complete",4069256,4069256,1,56728499773965,56728499773965 + "MARKER_CORE_API","Kokkos::Impl::CombinedFunctorReducer, CountFunctor, long int>::Reducer, void>",4069256,4069256,2,56728501756088,56728501764241 + "MARKER_CORE_API","Kokkos::parallel_reduce: fence due to result being value, not view",4069256,4069256,4,56728501767957,56728501769600 + "MARKER_CORE_API","Kokkos::Finalization Complete",4069256,4069256,6,56728502054554,56728502054554 + Kernel trace ++++++++++++++ @@ -315,7 +444,7 @@ To trace kernel dispatch traces, use: .. code-block:: shell - rocprofv3 --kernel-trace -- < app_relative_path > + rocprofv3 --kernel-trace -- The above command generates a ``kernel_trace.csv`` file prefixed with the process ID. @@ -339,7 +468,7 @@ To trace memory moves across the application, use: .. code-block:: shell - rocprofv3 –-memory-copy-trace -- < app_relative_path > + rocprofv3 –-memory-copy-trace -- The above command generates a ``memory_copy_trace.csv`` file prefixed with the process ID. @@ -372,7 +501,7 @@ memory operations (copies and scratch). .. code-block:: shell - rocprofv3 –-runtime-trace -- < app_relative_path > + rocprofv3 –-runtime-trace -- Running the above command generates ``hip_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``scratch_memory_trace.csv``,and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application) files prefixed with the process ID. @@ -383,7 +512,7 @@ This is an all-inclusive option to collect all the above-mentioned traces. .. code-block:: shell - rocprofv3 –-sys-trace -- < app_relative_path > + rocprofv3 –-sys-trace -- Running the above command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application) files prefixed with the process ID. @@ -394,19 +523,45 @@ This option collects scratch memory operation's traces. Scratch is an address sp .. code-block:: shell - rocprofv3 --scratch-memory-trace -- < app_relative_path > + rocprofv3 --scratch-memory-trace -- + + +RCCL trace +++++++++++++ + +`RCCL `_ (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs. This option traces those communication routines. + +.. code-block:: shell + + rocprofv3 --rccl-trace -- -Stats -++++++++ +The above command generates a ``rccl_api_trace`` file prefixed with the process ID. + +.. code-block:: shell + + $ cat 197_rccl_api_trace.csv + +Here are the contents of ``rccl_api_trace.csv`` file: + +.. csv-table:: RCCL trace + :file: /data/rccl_trace.csv + :widths: 10,10,10,10,10,20,20 + :header-rows: 1 + +Post-processing tracing options +++++++++++++++++++++++++++++++++ + +1. Stats ++++++++++ This option collects statistics for the enabled tracing types. For example, to collect statistics of HIP APIs, when HIP trace is enabled. A higher percentage in statistics can help user focus on the API/function that has taken the most time: .. code-block:: shell - rocprofv3 --stats --hip-trace -- < app_relative_path > + rocprofv3 --stats --hip-trace -- -The above command generates a ``hip_api_stats.csv`` and ``hip_api_trace`` file prefixed with the process ID. +The above command generates a ``hip_api_stats.csv``, ``domain_stats.csv`` and ``hip_api_trace.csv`` file prefixed with the process ID. .. code-block:: shell @@ -419,8 +574,60 @@ Here are the contents of ``hip_api_stats.csv`` file: :widths: 10,10,20,20,10,10,10,10 :header-rows: 1 +Here are the contents of ``domain_stats.csv`` file: + +.. csv-table:: Domain stats + :file: /data/hip_domain_stats.csv + :widths: 10,10,20,20,10,10,10,10 + :header-rows: 1 + For the description of the fields in the output file, see :ref:`output-file-fields`. +2. Summary ++++++++++++ + +Output single summary of tracing data at the conclusion of the profiling session + +.. code-block:: shell + + rocprofv3 -S --hip-trace -- + +.. image:: /data/rocprofv3_summary.png + + +2.1 Summary per domain +++++++++++++++++++++++ + +Outputs the summary of each tracing domain at the end of profiling session. + +.. code-block:: shell + + rocprofv3 -D --hsa-trace --hip-trace -- + +The above command generates a ``hip_trace.csv``, ``hsa_trace.csv`` file prefixed with the process ID along with the summary of each domain at the terminal. + +2.2 Summary groups ++++++++++++++++++++ + +Users can create a summary of multiple domains by specifying the domain names in the command line. The summary groups are separated by a pipe (|) symbol. +To create a summary for ``MEMORY_COPY`` domains, use: + +.. code-block:: shell + + rocprofv3 --summary-groups MEMORY_COPY --sys-trace -- + +.. image:: /data/rocprofv3_memcpy_summary.png + + +To create a summary for ``MEMORY_COPY`` and ``HIP_API`` domains, use: + +.. code-block:: shell + + rocprofv3 --summary-groups 'MEMORY_COPY|HIP_API' --sys-trace -- + +.. image:: /data/rocprofv3_hip_memcpy_summary.png + + Kernel profiling ------------------- @@ -510,7 +717,7 @@ Properties { "jobs": [ { - "pmc": ["SQ_WAVES", "GRBM_COUNT", "GUI_ACTIVE"] + "pmc": ["SQ_WAVES", "GRBM_COUNT", "GRBM_GUI_ACTIVE"] }, { "pmc": ["FETCH_SIZE", "WRITE_SIZE"], @@ -534,7 +741,7 @@ Properties - pmc: - SQ_WAVES - GRBM_COUNT - - GUI_ACTIVE + - GRBM_GUI_ACTIVE - 'TCC_HIT[1]' - 'TCC_HIT[2]' - pmc: @@ -551,7 +758,7 @@ To supply the counters via ``command-line`` options, use: .. code-block:: shell - rocprofv3 --pmc SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE -- + rocprofv3 --pmc SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE -- .. note:: 1. Please note that more than 1 counters should be separated by a space or a comma. @@ -564,7 +771,7 @@ To supply the input file for kernel profiling, use: .. code-block:: shell - rocprofv3 -i input.txt -- + rocprofv3 -i input.txt -- Running the above command generates a ``./pmc_n/counter_collection.csv`` file prefixed with the process ID. For each ``pmc`` row, a directory ``pmc_n`` containing a ``counter_collection.csv`` file is generated, where n = 1 for the first row and so on. @@ -636,7 +843,7 @@ To collect counters for the kernels matching the filters specified in the preced .. code-block:: shell - rocprofv3 -i input.yml -- + rocprofv3 -i input.yml -- $ cat pass_1/312_counter_collection.csv "Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp" diff --git a/source/docs/rocprofiler-sdk.dox.in b/source/docs/rocprofiler-sdk.dox.in index 7188a97c..e8fdd760 100644 --- a/source/docs/rocprofiler-sdk.dox.in +++ b/source/docs/rocprofiler-sdk.dox.in @@ -139,7 +139,8 @@ FILE_PATTERNS = *.h \ *.tcc \ conf.py RECURSIVE = YES -EXCLUDE = +EXCLUDE = @SOURCE_DIR@/README.md \ + @SOURCE_DIR@/include/rocprofiler-sdk/rccl/details/rccl.h EXCLUDE_SYMLINKS = YES EXCLUDE_PATTERNS = */.git/* \ @SOURCE_DIR@/**/tests/* \ @@ -266,7 +267,7 @@ LATEX_CMD_NAME = latex MAKEINDEX_CMD_NAME = makeindex LATEX_MAKEINDEX_CMD = makeindex COMPACT_LATEX = NO -PAPER_TYPE = +PAPER_TYPE = a4 EXTRA_PACKAGES = float LATEX_HEADER = LATEX_FOOTER = @@ -332,7 +333,8 @@ MACRO_EXPANSION = YES EXPAND_ONLY_PREDEF = NO SEARCH_INCLUDES = NO INCLUDE_PATH = @SOURCE_DIR@/source/include -INCLUDE_FILE_PATTERNS = *.h +INCLUDE_FILE_PATTERNS = *.h \ + *.hpp PREDEFINED = "ROCPROFILER_API=" \ "ROCPROFILER_EXPORT=" \ "ROCPROFILER_IMPORT=" \ diff --git a/source/include/rocprofiler-sdk/rccl/details/rccl.h b/source/include/rocprofiler-sdk/rccl/details/rccl.h index 304f6025..e114a26d 100644 --- a/source/include/rocprofiler-sdk/rccl/details/rccl.h +++ b/source/include/rocprofiler-sdk/rccl/details/rccl.h @@ -47,7 +47,7 @@ typedef struct } ncclUniqueId; /*! @defgroup rccl_result_code Result Codes - @details The various result codes that RCCL API calls may return + @brief The various result codes that RCCL API calls may return @{ */ /*! @brief Result type @@ -71,12 +71,14 @@ typedef enum #define NCCL_SPLIT_NOCOLOR -1 /*! @defgroup rccl_config_type Communicator Configuration - @details Structure that allows for customizing Communicator behavior via + @brief Structure that allows for customizing Communicator behavior via ncclCommInitRankConfig @{ */ -/*! @brief Communicator configuration - @details Users can assign value to attributes to specify the behavior of a communicator */ +/** + * @defgroup Communicator configuration + * @brief Users can assign value to attributes to specify the behavior of a communicator. + */ typedef struct ncclConfig_v21700 { /* attributes that users should never touch. */ @@ -122,7 +124,7 @@ ncclResult_t pncclMemFree(void* ptr); /*! @defgroup rccl_api_version Version Information - @details API call that returns RCCL version + @brief API call that returns RCCL version @{ */ /*! @brief Return the RCCL_VERSION_CODE of RCCL in the supplied integer. @@ -140,7 +142,7 @@ pncclGetVersion(int* version); /*! @} */ /*! @defgroup rccl_api_communicator Communicator Initialization/Destruction - @details API calls that operate on communicators. + @brief API calls that operate on communicators. Communicators objects are used to launch collective communication operations. Unique ranks between 0 and N-1 must be assigned to each HIP device participating in the same Communicator. @@ -292,7 +294,7 @@ pncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclCon /*! @} */ /*! @defgroup rccl_api_errcheck Error Checking Calls - @details API calls that check for errors + @brief API calls that check for errors @{ */ /*! @brief Returns a string for each result code. @@ -330,7 +332,7 @@ pncclCommGetAsyncError(ncclComm_t comm, ncclResult_t* asyncError); /*! @} */ /*! @defgroup rccl_api_comminfo Communicator Information - @details API calls that query communicator information + @brief API calls that query communicator information @{ */ /*! @brief Gets the number of ranks in the communicator clique. @@ -391,7 +393,7 @@ pncclCommDeregister(const ncclComm_t comm, void* handle); /*! @endcond */ /*! @defgroup rccl_api_enumerations API Enumerations - @details Enumerations used by collective communication calls + @brief Enumerations used by collective communication calls @{ */ /*! @brief Dummy reduction enumeration @@ -454,7 +456,7 @@ typedef enum /*! @} */ /*! @defgroup rccl_api_custom_redop Custom Reduction Operator - @details API calls relating to creation/destroying custom reduction operator + @brief API calls relating to creation/destroying custom reduction operator that pre-multiplies local source arrays prior to reduction @{ */ @@ -516,7 +518,7 @@ pncclRedOpDestroy(ncclRedOp_t op, ncclComm_t comm); /*! @} */ /*! @defgroup rccl_collective_api Collective Communication Operations - @details Collective communication operations must be called separately for each + @brief Collective communication operations must be called separately for each communicator in a communicator clique. They return when operations have been enqueued on the HIP stream. @@ -935,7 +937,7 @@ pncclAllToAllv(const void* sendbuff, /*! @} */ /*! @defgroup msccl_api MSCCL Algorithm - @details API calls relating to the optional MSCCL algorithm datapath + @brief API calls relating to the optional MSCCL algorithm datapath @{ */ /*! @brief Opaque handle to MSCCL algorithm */ @@ -1030,7 +1032,7 @@ pmscclUnloadAlgo(mscclAlgoHandle_t mscclAlgoHandle); /*! @} */ /*! @defgroup rccl_group_api Group semantics - @details When managing multiple GPUs from a single thread, and since RCCL collective + @brief When managing multiple GPUs from a single thread, and since RCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for different ranks/devices into a single call. diff --git a/source/include/rocprofiler-sdk/registration.h b/source/include/rocprofiler-sdk/registration.h index 837367af..a1a8f813 100644 --- a/source/include/rocprofiler-sdk/registration.h +++ b/source/include/rocprofiler-sdk/registration.h @@ -30,7 +30,7 @@ ROCPROFILER_EXTERN_C_INIT /** * @defgroup REGISTRATION_GROUP Tool registration * - * Data types and functions for tool registration with rocprofiler + * @brief Data types and functions for tool registration with rocprofiler * @{ */ diff --git a/source/include/rocprofiler-sdk/rocprofiler.h b/source/include/rocprofiler-sdk/rocprofiler.h index 1681cef9..d2b5a632 100644 --- a/source/include/rocprofiler-sdk/rocprofiler.h +++ b/source/include/rocprofiler-sdk/rocprofiler.h @@ -86,7 +86,7 @@ ROCPROFILER_EXTERN_C_INIT /** * @defgroup MISCELLANEOUS_GROUP Miscellaneous Utility Functions - * + * @brief utility functions for library * @{ */