Releases: xlite-dev/CUDA-Learn-Notes
📚FA2: QK Fine-grained Tiling
What's Changed
- [FA2] hotfix flash-attn-mma smem size setting✔️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/170
- [FA2] reorder grid layout, boost 5~10% TFLOPS✔️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/171
- [FA2] optimize block tiling for headdim >= 128✔️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/172
- [FA2] flash-attn-mma tiling-qk for large d⚡️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/173
- [FA2] fix tiling-qk misaligned address✔️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/174
- [README] Refactor README.md✔️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/175
- [README] Refactor README✔️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/176
📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024
)
// Fine-grained tiling at the MMA level for Q and K results in a constant SRAM usage of
// 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to
// an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to
// extend D (head dimension) up to 1024. Performance is stay tuned for updates ~
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.9...v2.6.10
FA2 Fully Shared QKV SMEM🎉
What's Changed
- [FA2] Update flash-attn-mma shared-kv/qkv🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/163
- [FA2] Update flash-attn-mma shared-kv/qkv🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/164
- [FA2] Update flash-attn-mma shared-qkv🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/165
- [FA2] Update flash-attn-mma shared-kv🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/166
- [FA2] Update flash-attn-mma split-kv/q🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/167
- [FA2] Update flash-attn-mma shared-qkv🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/168
- [FA2] flash-attn-mma get rid of transpose-k✔️ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/169
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.8...v2.6.9
FA2 Fully Shared QKV SMEM🎉
FA2 Fully Shared QKV SMEM🎉
What's Changed
- [FA2] Release flash-attn-mma split-kv/q🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/161
- [FA2] Release flash-attn-mma shared-kv/qkv🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/162
I have also implemented FlashAttention-2 using pure MMA PTX instructions, which supports features such as Multi-Stages, Tile MMA, Tile Warp, Fully Shared QKV SMEM, Prefetch Q s2r, Collective Store, etc. Currently, for small-scale attention (B<=4, H <=48, SeqLen <= 8192)
can run faster than offical FA2 on some Devices, for example, NVIDIA RTX 3080 Laptop.
- Example: B=1, H=8, N=8192, D=64 (NVIDIA RTX 3080 Laptop)
python3 flash_attn_mma.py --B 1 --H 8 --D 64 --N 8192 --iters 10 # NVIDIA RTX 3080 Laptop
------------------------------------------------------------------------------------------------------------------------
B: batch_size, H: n_head, N: seq_len, D: head_dim, seed: 1617, Warmup: 1, Iters: 10
------------------------------------------------------------------------------------------------------------------------
B=1, H=8, N=8192, D=64, Warmup: 1, Iters: 10
mma(split-kv+stage1): ['0.01960754 ', '0.01452637 ', '-0.02592468 '], time:5.586338ms, TFLOPS:25.08
mma(split-kv+stage2): ['0.01960754 ', '0.01452637 ', '-0.02592468 '], time:5.326223ms, TFLOPS:26.31
mma(split-q+stage1): ['0.01960754 ', '0.01452637 ', '-0.02592468 '], time:3.834152ms, TFLOPS:36.54
mma(split-q+stage2): ['0.01960754 ', '0.01452637 ', '-0.02592468 '], time:4.328346ms, TFLOPS:32.37
mma(split-q+share-kv+stage1): ['0.01960754 ', '0.01452637 ', '-0.02592468 '], time:2.636528ms, TFLOPS:53.15
mma(split-q+share-qkv+stage1): ['0.01960754 ', '0.01452637 ', '-0.02592468 '], time:2.594471ms, TFLOPS:54.01
mma(split-q+share-qkv+stage2): ['0.01960754 ', '0.01452637 ', '-0.02592468 '], time:2.574611ms, TFLOPS:54.42
(flash): ['0.01963806 ', '0.0145874 ', '-0.02593994 '], time:3.764462ms, TFLOPS:37.22
-----------------------------------------------------------------------------------------------------------------------
However, for large-scale attention computations, there remains a performance gap. Performance is continuously being optimized. Stay tuned for updates ~ Please refer to flash-attention-mma⚡️⚡️ for more details.
Tensor Cores | Loop over Seqlen/Headdim | Tile Block (Br, Bc) | MMA (m16n8k16) |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Pack LDST (128 bits) | SMEM Padding | Copy Async | Tile MMA (More Threads) |
✔️ | ✔️ | ✔️ | ✔️ |
Tile Warp (More Values) | Multi Stages (1/2) | Collective Store (Shfl) | Split KV/Q |
✔️ | ✔️ | ✔️ | ✔️ |
Shared KV SMEM | Fully Shared QKV SMEM | Prefetch Q s2r | SMEM/Block Swizzle |
✔️ | ✔️ | ✔️ | ? |
The Split KV
and Split Q
implementations have been carried out in flash-attention-mma⚡️⚡️ for performance comparison. The Split KV
method, which involves splitting all QKV across MMA (Warps), is slower than Split Q
policy, which splitting Q across MMA(Warps) and keep access KV for all MMA(Warps).
- 📚 Split KV (Basic, FlashAttention-1)
// Split QKV across MMA(Warps) using naive matmul MMA&Warp tiling policy.
// case: The layout of 8 MMA(2x4) [after] kWarpTileSeqLenQxkWarpTileSeqLenK(2x2) -> 32x2,32x2=64x64:
// | [64,64] | warp_KV 0 | warp_KV 1 | warp_KV 2 | warp_KV 3 |
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
__global__ void
flash_attn_mma_stages_split_kv_kernel(half* Q, // [B, H, N, D]
half* K, // [B, H, D, N] K^T transposed
half* V, // [B, H, N, D]
half* O, // [B, H, N, D]
int QKV_seqlen);
- 📚 Split Q (Faster, FlashAttention-2)
// Split Q across MMA(Warps) and keep access KV for all MMA(Warps),
// in order to reduce the comm between warps via smem and warp shuffle.
// case: MMA = m16n8k16, Br=16x4=64, Bc=8x8=64, layout: 4 warps
// | 64x64 | warp_KV 0 |
// | warp_QP 0 | MMA 0 ... MMA 0 (x8) |
// | warp_QP 1 | MMA 1 ... MMA 1 (x8) |
// | warp_QP 2 | MMA 2 ... MMA 2 (x8) |
// | warp_QP 3 | MMA 3 ... MMA 3 (x8) |
__global__ void
flash_attn_mma_stages_split_q_kernel(half* Q, // [B, H, N, D]
half* K, // [B, H, D, N] K^T transposed
half* V, // [B, H, N, D]
half* O, // [B, H, N, D]
int QKV_seqlen);
- 📚 Split Q + Shared KV SMEM (Faster+)
// K, V shared the same shared memory, improve block occupancy.
__global__ void
flash_attn_mma_stages_split_q_shared_kv_kernel(half* Q,
half* K,
half* V,
half* O,
int QKV_seqlen);
- 📚 Split Q + Fully Shared QKV SMEM (Faster++)
// Q, K, V fully shared the same shared memory and prefetch Q s2r, improve block occupancy.
__global__ void
flash_attn_mma_stages_split_q_shared_qkv_kernel(half* Q,
half* K,
half* V,
half* O,
int QKV_seqlen);
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.7...v2.6.8
🎉FA2 MMA Split KV/Q
What's Changed
- [FlashAttention] Update flash-attention-mma 0.0.1 🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/159
- [FA2] Release flash-attn-mma split-kv/q🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/160
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.6...v2.6.7
🎉flash-attention-mma 0.0.1
What's Changed
- [HGEMM] CuTe HGEMM debug Makefile target by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/154
- [Softmax] Update Online Softmax bindings by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/155
- [FlashAttention] Refactor toy-flash-attn codes part-1 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/156
- [Bug]Fix typo by @wjj19950828 in https://github.com/DefTruth/CUDA-Learn-Notes/pull/157
- [FlashAttention] Release flash-atttention-mma 0.0.1 🎉 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/158
New Contributors
- @wjj19950828 made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/157
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.5...v2.6.6
⚡️⚡️toy-hgemm library
What's Changed
- [HGEMM] Update RTX 3080 Laptop perf by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/148
- [HGEMM] Update toy-hgemm library 0.1.0 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/149
- [HGEMM] Update toy-hgemm library 0.1.0 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/150
- [HGEMM] Update toy-hgemm library 0.1.0 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/152
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.4...v2.6.5
toy-hgemm library
What's Changed
- [HGEMM] Release toy-hgemm library 0.1.0 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/146
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.3...v2.6.4
toy-hgemm library
What's Changed
- [HGEMM] Release toy-hgemm library 0.1.0 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/145
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.2...v2.6.3
CuTe HGEMM Block Swizzle
What's Changed
- [HGEMM] trans mat b from row major -> col major by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/135
- [HGEMM] refactor HGEMM cpp benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/136
- [HGEMM] Update HGEMM L20/4090 Bench by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/137
- [HGEMM] fix cublas hgemm handle error by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/138
- [HGEMM] Add MMA HGEMM NN C++ benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/139
- [HGEMM] CuTe HGEMM with Thread Block Swizzle by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/140
- [HGEMM] clear tensor cache avoid OOM by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/141
- [HGEMM] Add gc.collect to HGEMM bench script by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/142
- [HGEMM] Add show_memory option to bench by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/143
- [HGEMM] manually init/destroy cublas handle by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/144
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6.1...v2.6.2
v2.6.1 CuTe HGEMM
What's Changed
- [HGEMM] Add large MNK block swizzle policy by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/132
- [HGEMM] Add CuTe HGEMM with SMEM Swizzle by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/134
- Update embedding.cu by @TheManWhoIsStupid in https://github.com/DefTruth/CUDA-Learn-Notes/pull/133
New Contributors
- @TheManWhoIsStupid made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/133
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.6...v2.6.1