We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tried CK Tile GEMM with V3 pipeline (https://github.com/ROCm/composable_kernel/blob/develop/example/ck_tile/03_gemm/universal_gemm.cpp) for compute bound cases (i.e., M = 4096, N = 4096 and K = 4096), but get much worse performance than (https://github.com/ROCm/composable_kernel/blob/develop/example/01_gemm/gemm_xdl_bf16_v3.cpp) with the same tile size 256x256x64.
CK Tile V3 (359.483 Tflops)
./bin/tile_example_universal_gemm -m=4096 -n=4096 -k=4096 -v=0 Launching kernel with args: grid: {16, 16, 1}, blocks: {256, 1, 1} Run Gemm kernel with M =4096 N =4096 K =4096 StrideA =4096 StrideB =4096 StrideC =4096 : 0.382324 ms, 359.483 TFlops, 263.293 GB/s,
vs. Old CK GEMM V3 (615.46 TFlops)
./bin/example_gemm_xdl_bf16_v3 0 2 1 4096 4096 4096 4096 4096 4096 1 a_m_k: dim 2, lengths {4096, 4096}, strides {4096, 1} b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096} c_m_n: dim 2, lengths {4096, 4096}, strides {4096, 1} RotatingMemWrapper: { size_a: 33554432, size_b: 33554432, rotating_count: 4} Perf: 0.223311 ms, 615.46 TFlops, 450.776 GB/s, DeviceGemmXdlUniversal<Default, RCR> BlkSize: 256, BlkTile: 256x256x64, WaveTile: 32x32, WaveMap: 4x4, VmemReadVec: 8x8, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3, BlkGemmPipelinePrefetchStages: 2
The text was updated successfully, but these errors were encountered:
@carlushuang @aosewski
Sorry, something went wrong.
Hi @zjing14. Internal ticket has been created to investigate your issue. Thanks!
No branches or pull requests
Tried CK Tile GEMM with V3 pipeline (https://github.com/ROCm/composable_kernel/blob/develop/example/ck_tile/03_gemm/universal_gemm.cpp) for compute bound cases (i.e., M = 4096, N = 4096 and K = 4096), but get much worse performance than (https://github.com/ROCm/composable_kernel/blob/develop/example/01_gemm/gemm_xdl_bf16_v3.cpp) with the same tile size 256x256x64.
CK Tile V3 (359.483 Tflops)
vs. Old CK GEMM V3 (615.46 TFlops)
The text was updated successfully, but these errors were encountered: