[Feature] Flash attention v3 support #735

jstjohn · 2025-03-08T00:05:00Z

Problem & Motivation

Flash attention v3 is significantly faster than flash attention v2 at long contexts. There are instructions for installing it and configuring it on the TE website.

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py#L199-L203

And then if you do this, regular flash attention v2 is still used when you are on a pre-hopper GPU:
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py#L460-L463

We can add the installation steps to our docker container setup.

BioNeMo Framework Version

v2.4.1

Proposed Solution

Follow the TE steps. Test that models still pass partial convergence tests and verify that the speed improvement at long context shows up when using the "flash" backend on H100 or newer.

Expected Benefits

Should be about as fast as the cuDNN implementation, which is up to 50% faster than flash attention v2 at the 1M context length when doing context parallelism on transformers. This would also benefit other architectures like evo2 that use attention in some layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Flash attention v3 support #735

[Feature] Flash attention v3 support #735

jstjohn commented Mar 8, 2025 •

edited

Loading

[Feature] Flash attention v3 support #735

[Feature] Flash attention v3 support #735

Comments

jstjohn commented Mar 8, 2025 • edited Loading

Problem & Motivation

BioNeMo Framework Version

Category

Proposed Solution

Expected Benefits

Code Example

jstjohn commented Mar 8, 2025 •

edited

Loading