Performance gap in sgemm_wmma gpu #256

xiefan46 · 2025-03-03T10:10:42Z

Found a big performance gap between custom sgemm_wmma implementation and cublas impl in A100 GPU. I tried to increase the number of stages to 10 but seems like it didn't help.

DefTruth · 2025-03-03T10:47:00Z

The performance of sgemm_wmma has not been fully optimized yet. We welcome you to submit a PR with optimizations.

xiefan46 · 2025-03-03T11:18:22Z

@DefTruth sure, let me take a look. Any idea where the gap came from?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance gap in sgemm_wmma gpu #256

Performance gap in sgemm_wmma gpu #256

xiefan46 commented Mar 3, 2025

DefTruth commented Mar 3, 2025

xiefan46 commented Mar 3, 2025

Performance gap in sgemm_wmma gpu #256

Performance gap in sgemm_wmma gpu #256

Comments

xiefan46 commented Mar 3, 2025

DefTruth commented Mar 3, 2025

xiefan46 commented Mar 3, 2025