A stream encapsulates a queue of tasks that are launched on the GPU device. This example showcases usage of multiple streams, each with their own tasks. These tasks include asynchronous memory copies using hipMemcpyAsync
and asynchronous kernel launches using myKernelName<<<...>>>
.
- Host side input and output memory is allocated using
hipHostMalloc
as pinned memory. It will ensure that the memory copies will be performed asynchronously when usinghipMemcpyAsync
. - Host input is instantiated.
- Device side storage is allocated using
hipMalloc
. - Two
hipStream_t
streams are created usinghipStreamCreate
. The example demonstrates launching two different kernels therefore each stream queues tasks related to each kernel launch. - Data is copied from host to device using
hipMemcpyAsync
. - Two kernels,
matrix_transpose_static_shared
andmatrix_transpose_dynamic_shared
are asynchronously launched using both the streams, repectively. - An asynchronous memory copy task (using
hipMemcpyAsync
) is queued into the streams that transfers the results from device to host. - The streams are destroyed using
hipStreamDestroy
. - The host explicitly waits for all tasks to finish using
hipDeviceSynchronize
. - Free any device side memory using
hipFree
. - Free host side pinned memory using
hipHostFree
.
A HIP stream allows device tasks to be grouped and launched asynchronously and independently from other tasks, which can be used to hide latencies and increase task completion throughput. When results of a task queued on a particular stream are needed, it can be explicitly synchronized without blocking work queued on other streams. Each HIP stream is tied to a particular device, which enables HIP streams to be used to schedule work across multiple devices simultaneously.
__shared__
__syncthreads
hipStream_t
hipStreamCreate
hipStreamDestroy
hipMalloc
hipHostMalloc
hipMemcpyAsync
hipDeviceSynchronize
hipFree
hipHostFree