Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md
inline_assembly_vs2017.sln		inline_assembly_vs2017.sln
inline_assembly_vs2017.vcxproj		inline_assembly_vs2017.vcxproj
inline_assembly_vs2017.vcxproj.filters		inline_assembly_vs2017.vcxproj.filters
inline_assembly_vs2019.sln		inline_assembly_vs2019.sln
inline_assembly_vs2019.vcxproj		inline_assembly_vs2019.vcxproj
inline_assembly_vs2019.vcxproj.filters		inline_assembly_vs2019.vcxproj.filters
inline_assembly_vs2022.sln		inline_assembly_vs2022.sln
inline_assembly_vs2022.vcxproj		inline_assembly_vs2022.vcxproj
inline_assembly_vs2022.vcxproj.filters		inline_assembly_vs2022.vcxproj.filters
main.hip		main.hip

README.md

HIP-Basic Inline Assembly Example

Description

This program showcases an implementation of a simple matrix transpose kernel, which uses inline assembly and works on both AMD and NVIDIA hardware.

By using inline assembly in your kernels, you may be able to gain extra performance. It could also enable you to use special GPU hardware features which are not available through compiler intrinsics.

For more insights, please read the following blogs by Ben Sander: The Art of AMDGCN Assembly: How to Bend the Machine to Your Will & AMD GCN Assembly: Cross-Lane Operations

For more information: AMD ISA documentation for current architectures & User Guide for LLVM AMDGPU Back-end

Application flow

A number of variables are defined to control the problem details and the kernel launch parameters.
Input matrix is set up in host memory.
The necessary amount of device memory is allocated and input is copied to the device.
The GPU transposition kernel is launched with previously defined arguments.
The kernel will use different inline assembly for its data movement, depending on the target platform.
The transposed matrix is copied back to the host and all device memory is freed.
The elements of the result matrix are compared with the expected result. The result of the comparison is printed to the standard output.

Key APIs and Concepts

Using inline assembly in GPU kernels is somewhat similar to using inline assembly in host-side code. The volatile statement tells the compiler to not remove the assembly statement during optimizations.

asm volatile("v_mov_b32_e32 %0, %1" : "=v"(variable_0) : "v"(variable_1))

However, since the instruction set differs between GPU architectures, you usually want to use the appropriate GPU architecture compiler defines to support multiple architectures (see the gpu_arch example for more fine-grained architecture control).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inline_assembly

inline_assembly

README.md

HIP-Basic Inline Assembly Example

Description

Application flow

Key APIs and Concepts

Demonstrated API Calls

HIP runtime

Device symbols

Host symbols

Files

inline_assembly

Directory actions

More options

Directory actions

More options

Latest commit

History

inline_assembly

Folders and files

parent directory

README.md

HIP-Basic Inline Assembly Example

Description

Application flow

Key APIs and Concepts

Demonstrated API Calls

HIP runtime

Device symbols

Host symbols