Matrix edges

Each workgroup is assigned a macro tile of shape macro-A x macro-B in C to work on, where C is m x n. We need to consider the case where m % macro-A is non-zero or n % macro-B is non-zero. One approach is to pad the memory of edge macro tiles of C with zeros, and then perform GEMM as if the matrix were of dimensions ceil(m/macro-A) x ceil(n/macro-B). This is the approach used by Matsumoto et al., (2012) , by Garg and Hendren, (2014), and in certain cases by Nugteren and Codreanu, (2015) in CLBlast. One drawback of this approach is the requirement for additional GPU memory and the data copying in global memory.

A second approach involves branching between the workitems (threads) in the edge workgroups. The idea here is to have workitems always check if the element C they are processing is indeed an element of C and does not lie beyond the matrix. This approach is not optimal however, as branching between threads hurts parallelism and can result in poor performance.

We

GPUs work

This is true for internal

Recall that a workgroup is made up MAC workitems, or threads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix edges

Clone this wiki locally