|
13 | 13 | </div>
|
14 | 14 |
|
15 | 15 |
|
16 |
| -📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥(Easy -> Hard++)](#cuda-kernel) with PyTorch bindings, [📖100+ LLM/VLM/CV/CUDA/CuTe🔥](#my-blogs-part-1) blogs, [📖toy-hgemm⚡️⚡️](./kernels/hgemm) which can achieve `98%~100%` performance of **cuBLAS**, and [📖flash-attention-mma⚡️⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉 |
| 16 | +📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥🔥(Easy -> Hard++)](#cuda-kernel) with PyTorch bindings, [📖100+ LLM/VLM/CV/CUDA/CuTe🔥](#my-blogs-part-1) blogs, [📖toy-hgemm⚡️⚡️](./kernels/hgemm) which can achieve `98%~100%` performance of **cuBLAS**, and [📖flash-attention-mma⚡️⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉 |
17 | 17 |
|
18 | 18 | <div id="contents"></div>
|
19 | 19 |
|
@@ -169,7 +169,7 @@ flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half*
|
169 | 169 | }
|
170 | 170 | ```
|
171 | 171 |
|
172 |
| -## 📖 150+ CUDA Kernels 🔥🔥 (Easy -> Hard++) ([©️back👆🏻](#contents)) |
| 172 | +## 📖 200+ CUDA Kernels 🔥🔥 (Easy -> Hard++) ([©️back👆🏻](#contents)) |
173 | 173 |
|
174 | 174 | <div id="cuda-kernel"></div>
|
175 | 175 |
|
@@ -357,6 +357,8 @@ The kernels listed here will guide you through a step-by-step progression, rangi
|
357 | 357 |
|
358 | 358 | ### 📚 Hard+ ⭐️⭐️⭐️⭐️ & Hard++ ⭐️⭐️⭐️⭐️⭐️ ([©️back👆🏻](#cuda-kernel))
|
359 | 359 |
|
| 360 | +- 📚 FlashAttention-2 MMA (MMA Acc F32/F16, swizzle, QKV smem share, fine-grained tiling, etc.🎉) |
| 361 | + |
360 | 362 | <div id="cuda-kernel-hard-plus"></div>
|
361 | 363 |
|
362 | 364 | |📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
|
@@ -390,9 +392,25 @@ The kernels listed here will guide you through a step-by-step progression, rangi
|
390 | 392 | | ✔️ [flash_attn...tiling_qkv_swizzle{qk}{f32}*](./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qkv_swizzle_qk_F32F16F16F32.cu)|f16|f32|[link](./kernels/flash-attn)|⭐️⭐️⭐️⭐️|
|
391 | 393 | | ✔️ [flash_attn...tiling_qkv_swizzle{qkv}{f32}*](./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qkv_swizzle_qkv_F32F16F16F32.cu)|f16|f32|[link](./kernels/flash-attn)|⭐️⭐️⭐️⭐️|
|
392 | 394 |
|
393 |
| -**rr**: means reduce registers usage (for `d>128`); **f32**: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; **swizzle**: now, only support smem swizzle for MMA. |
| 395 | +💡NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; **swizzle**: now, only support smem swizzle for MMA. |
| 396 | + |
| 397 | +- 📚 FFPA Attention MMA (**1.8x~3x**🎉faster vs SDPA EA, D > 256, FA2 not supported) |
394 | 398 |
|
395 |
| -## 📖 博客目录 |
| 399 | +|📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level | |
| 400 | +|:---|:---|:---|:---|:---| |
| 401 | +| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 402 | +| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 403 | +| ✔️ [ffpa_mma_stages_split_q_L1_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 404 | +| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 405 | +| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 406 | +| ⚠️ [ffpa_mma_stages_split_q_L2_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 407 | +| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 408 | +| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 409 | +| ⚠️ [ffpa_mma_stages_split_q_L3_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| |
| 410 | + |
| 411 | +💡NOTE: 🤖[ffpa-attn-mma](https://github.com/DefTruth/ffpa-attn-mma): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-4090). |
| 412 | + |
| 413 | +## 📖 100+ LLM/VLM/CV/CUDA/CuTe Tech Blogs |
396 | 414 |
|
397 | 415 | <div id="my-blogs-part-1"></div>
|
398 | 416 |
|
|
0 commit comments