You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue tracks the issues when developing avx2 CK
CPU only compile. A lot of headers are included hip_runtime.h, and use __device__ / __host__ symbol to describe host/device code. Better decouple device related code for CPU only.
DynamicBuffer contains GPU intrinsic for memory operation. for CPU operation, may need utilize avx related intrinsic.
__attribute__((ext_vector_type(N))) seems not recognize, 64/126/256 bit register. For ext_vector_type(8) will generate 2 xmm register, for ext_vector_type(4) will single generate xmm (this is wanted), but ext_vector_type(2)still generate singlexmm`. This gives us some difficulty for implementing the vector type on CPU.
Also, each ymm/xmm can not iterate over the inner 8 float / 4 float one by one, and apply an element wise operation. This register must be treated as a whole. Hence StaticallyIndexedArray can not be utilized.
register for frontend programming are limited, this implies we don't prefer to implement thread local buffer by using register to hold data. So every micro kernel will need to write the result into memory (cache) then do next iteration.
for level of task distribution, we design following multi-level gemm:
a). thread wise gemm: this is the micro kernel, with A/B matrix hope to exist in L1 cache.
b). block wise gemm: A/B matrix hope to exist in L2/L3 cache, or we call it cache block. Unlike the naming, this is still run on a single thread.
c). grid wise gemm: this is the whole task size. And we try do multi-thread on this level.
numa binding, thread binding
In multi-thread environment, bind thread to different core will have a big performance difference, especially on Current Zen chiplet design.
TODO:Zen optimization guide
tile blocking support not evenly divided block.
a). gridwise/block/thread wise gemm need calculate current block size at runtime instead of compile time
b). threadwise gemm distribute to different kernel.
c). threadwise transfer need deal with unevenly divided size and packing (or not packing).
transpose while read/write using avx register (how to describe by tensor transform)
DimAccessOrder with openmp
e.g. Order is <0, 1, 2>, and we need merge dim:1, dim:2 to utilize openmp for multi thread distribution
dynamic threadwise copy
today GPU use static_ford to the copy dimension. But for cpu, the number of iteration would be thousands or tens of thousands, which is not good enough to statically expand code.
The text was updated successfully, but these errors were encountered:
carlushuang
changed the title
[avx2] issues with avx2 prototyping
[avx2] design/issues with avx2 prototyping
Apr 7, 2022
This issue tracks the issues when developing avx2 CK
CPU only compile. A lot of headers are included
hip_runtime.h
, and use__device__
/__host__
symbol to describe host/device code. Better decouple device related code for CPU only.DynamicBuffer
contains GPU intrinsic for memory operation. for CPU operation, may need utilize avx related intrinsic.__attribute__((ext_vector_type(N)))
seems not recognize, 64/126/256 bit register. Forext_vector_type(8)
will generate 2xmm
register, forext_vector_type(4)
will single generatexmm
(this is wanted), but ext_vector_type(2)still generate single
xmm`. This gives us some difficulty for implementing the vector type on CPU.Also, each ymm/xmm can not iterate over the inner 8 float / 4 float one by one, and apply an element wise operation. This register must be treated as a whole. Hence
StaticallyIndexedArray
can not be utilized.register for frontend programming are limited, this implies we don't prefer to implement thread local buffer by using register to hold data. So every micro kernel will need to write the result into memory (cache) then do next iteration.
for level of task distribution, we design following multi-level gemm:
a). thread wise gemm: this is the micro kernel, with A/B matrix hope to exist in L1 cache.
b). block wise gemm: A/B matrix hope to exist in L2/L3 cache, or we call it cache block. Unlike the naming, this is still run on a single thread.
c). grid wise gemm: this is the whole task size. And we try do multi-thread on this level.
numa binding, thread binding
In multi-thread environment, bind thread to different core will have a big performance difference, especially on Current Zen chiplet design.
TODO:Zen optimization guide
tile blocking support not evenly divided block.
a). gridwise/block/thread wise gemm need calculate current block size at runtime instead of compile time
b). threadwise gemm distribute to different kernel.
c). threadwise transfer need deal with unevenly divided size and packing (or not packing).
transpose while read/write using avx register (how to describe by tensor transform)
DimAccessOrder with openmp
e.g. Order is <0, 1, 2>, and we need merge dim:1, dim:2 to utilize openmp for multi thread distribution
dynamic threadwise copy
today GPU use
static_ford
to the copy dimension. But for cpu, the number of iteration would be thousands or tens of thousands, which is not good enough to statically expand code.The text was updated successfully, but these errors were encountered: