Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[avx2] design/issues with avx2 prototyping #180

Open
carlushuang opened this issue Apr 7, 2022 · 0 comments
Open

[avx2] design/issues with avx2 prototyping #180

carlushuang opened this issue Apr 7, 2022 · 0 comments

Comments

@carlushuang
Copy link
Contributor

carlushuang commented Apr 7, 2022

This issue tracks the issues when developing avx2 CK

  1. CPU only compile. A lot of headers are included hip_runtime.h, and use __device__ / __host__ symbol to describe host/device code. Better decouple device related code for CPU only.

  2. DynamicBuffer contains GPU intrinsic for memory operation. for CPU operation, may need utilize avx related intrinsic.

  3. __attribute__((ext_vector_type(N))) seems not recognize, 64/126/256 bit register. For ext_vector_type(8) will generate 2 xmm register, for ext_vector_type(4) will single generate xmm (this is wanted), but ext_vector_type(2)still generate singlexmm`. This gives us some difficulty for implementing the vector type on CPU.

  4. Also, each ymm/xmm can not iterate over the inner 8 float / 4 float one by one, and apply an element wise operation. This register must be treated as a whole. Hence StaticallyIndexedArray can not be utilized.

  5. register for frontend programming are limited, this implies we don't prefer to implement thread local buffer by using register to hold data. So every micro kernel will need to write the result into memory (cache) then do next iteration.

  6. for level of task distribution, we design following multi-level gemm:
    a). thread wise gemm: this is the micro kernel, with A/B matrix hope to exist in L1 cache.
    b). block wise gemm: A/B matrix hope to exist in L2/L3 cache, or we call it cache block. Unlike the naming, this is still run on a single thread.
    c). grid wise gemm: this is the whole task size. And we try do multi-thread on this level.

  7. numa binding, thread binding
    In multi-thread environment, bind thread to different core will have a big performance difference, especially on Current Zen chiplet design.
    TODO:Zen optimization guide

  8. tile blocking support not evenly divided block.
    a). gridwise/block/thread wise gemm need calculate current block size at runtime instead of compile time
    b). threadwise gemm distribute to different kernel.
    c). threadwise transfer need deal with unevenly divided size and packing (or not packing).

  9. transpose while read/write using avx register (how to describe by tensor transform)

  10. DimAccessOrder with openmp
    e.g. Order is <0, 1, 2>, and we need merge dim:1, dim:2 to utilize openmp for multi thread distribution

  11. dynamic threadwise copy
    today GPU use static_ford to the copy dimension. But for cpu, the number of iteration would be thousands or tens of thousands, which is not good enough to statically expand code.

@carlushuang carlushuang changed the title [avx2] issues with avx2 prototyping [avx2] design/issues with avx2 prototyping Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant