-
Notifications
You must be signed in to change notification settings - Fork 11
Matrix edges
Each workgroup is assigned a macro tile of shape macro-A
x macro-B
in C
to work on, where C
is m
x n
. We need to consider the case where m
% macro-A
is non-zero or n
% macro-B
is non-zero. One approach is to pad the memory of edge macro tiles of C
with zeros, and then perform GEMM as if the matrix were of dimensions ceil(m
/macro-A
) x ceil(n
/macro-B
). This is the approach used by
Matsumoto et al., (2012) , by Garg and Hendren, (2014), and in certain cases by Nugteren and Codreanu, (2015) in CLBlast. One drawback of this approach is the requirement for additional GPU memory and the data copying in global memory.
A second approach involves branching between the workitems (threads) in the edge workgroups. The idea here is to have workitems always check if the element C
they are processing is indeed an element of C
and does not lie beyond the matrix. This approach is not optimal however, as branching between threads hurts parallelism and can result in poor performance.
We
GPUs work
This is true for internal
Recall that a workgroup is made up MAC
workitems, or threads.