Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CodePartition] Optimize the placement of consumer releases (#11)
Fix the inefficient placement of consumer releases when producer and consumer are not in the same scope. For example, given ``` Q = tl.load for (..) K = tl.load QK = dot(Q, K) ... tl.store ``` Previously the consumer release corresponding to `Q` was placed after the store. With the current fix the release would go right after the for loop. `TORCH_CUDA_ARCH_LIST=9.0a python run.py --op flash_attention --only triton_tutorial_flash_v2_tma_ws,triton_tutorial_flash_v2_tma_ws_persistent,triton_tutorial_flash_v2 --num-inputs 1 --seq-len 10 --metrics tflops --batch 1024 --n-heads 4 --d-head 128 --cudagraph` Before: ``` (Batch, Heads, SeqLen, Dhead) triton_tutorial_flash_v2_tma_ws-tflops triton_tutorial_flash_v2_tma_ws_persistent-tflops triton_tutorial_flash_v2-tflops ------------------------------- ---------------------------------------- --------------------------------------------------- --------------------------------- (1024, 4, 1024, 128) 393.141 400.046 366.498 ``` ``` After: (Batch, Heads, SeqLen, Dhead) triton_tutorial_flash_v2_tma_ws-tflops triton_tutorial_flash_v2_tma_ws_persistent-tflops triton_tutorial_flash_v2-tflops ------------------------------- ---------------------------------------- --------------------------------------------------- --------------------------------- (1024, 4, 1024, 128) 396.43 422.847 363.753 ```
- Loading branch information