Skip to content

Commit 90d420a

Browse files
committed
update docs
1 parent cc7efa2 commit 90d420a

File tree

1 file changed

+17
-4
lines changed

1 file changed

+17
-4
lines changed

README.md

+17-4
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Inference of Stable Diffusion and Flux in pure C/C++
2424
- Full CUDA, Metal, Vulkan and SYCL backend for GPU acceleration.
2525
- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
2626
- No need to convert to `.ggml` or `.gguf` anymore!
27-
- Flash Attention for memory usage optimization (only cpu for now)
27+
- Flash Attention for memory usage optimization
2828
- Original `txt2img` and `img2img` mode
2929
- Negative prompt
3030
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
@@ -182,11 +182,21 @@ Example of text2img by using SYCL backend:
182182
183183
##### Using Flash Attention
184184
185-
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
185+
Enabling flash attention for the diffusion model reduces memory usage by varying amounts of MB.
186+
eg.:
187+
- flux 768x768 ~600mb
188+
- SD2 768x768 ~1400mb
186189
190+
For most backends, it slows things down, but for cuda it generally speeds it up too.
191+
At the moment, it is only supported for some models and some backends (like cpu, cuda/rocm, metal).
192+
193+
Run by adding `--diffusion-fa` to the arguments and watch for:
187194
```
188-
cmake .. -DSD_FLASH_ATTN=ON
189-
cmake --build . --config Release
195+
[INFO ] stable-diffusion.cpp:312 - Using flash attention in the diffusion model
196+
```
197+
and the compute buffer shrink in the debug log:
198+
```
199+
[DEBUG] ggml_extend.hpp:1004 - flux compute buffer size: 650.00 MB(VRAM)
190200
```
191201
192202
### Run
@@ -239,6 +249,9 @@ arguments:
239249
--vae-tiling process vae in tiles to reduce memory usage
240250
--vae-on-cpu keep vae in cpu (for low vram)
241251
--clip-on-cpu keep clip in cpu (for low vram).
252+
--diffusion-fa use flash attention in the diffusion model (for low vram).
253+
Might lower quality, since it implies converting k and v to f16.
254+
This might crash if it is not supported by the backend.
242255
--control-net-cpu keep controlnet in cpu (for low vram)
243256
--canny apply canny preprocessor (edge detection)
244257
--color Colors the logging tags according to level

0 commit comments

Comments
 (0)