Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于mma中的一些句法的疑惑 #258

Open
lzcchl opened this issue Mar 11, 2025 · 1 comment
Open

关于mma中的一些句法的疑惑 #258

lzcchl opened this issue Mar 11, 2025 · 1 comment

Comments

@lzcchl
Copy link

lzcchl commented Mar 11, 2025

https://github.com/DefTruth/CUDA-Learn-Notes/blob/main/kernels/hgemm/mma/basic/hgemm_mma.cu#L94

关于这几行代码,想了很久没有完全理解,请教您一下:
uint32_t load_smem_a_ptr = __cvta_generic_to_shared(&s_a[lane_id % 16][(lane_id / 16) * 8]);
LDMATRIX_X4(RA[0], RA[1], RA[2], RA[3], load_smem_a_ptr);

问题1:
s_a[lane_id % 16][(lane_id / 16) * 8]这里对应的位置是[0~15][0,8],
但实际lane_id为0时取到s_a[0][0],lane_id为1时取到s_a[1][0],我想问为什么不是取s_a[0][8],这样取内存不应该才是连续的吗?

问题2:
这里s_a的索引[0~15][0,8],意味着每个线程取8个half,但是这几行代码里面,并没有像LDST128BITS这样的字眼,是哪里给自动分配吗,还是其他机制?

问题3:
关于load_smem_a_ptr,到底是什么,没有理解透彻,因为每个线程的land_id都不一样,但是都返回给load_smem_a_ptr,这个load_smem_a_ptr没有加任何的偏移,你能讲讲偏移去哪了吗?

@DefTruth
Copy link
Member

这是根据PTX文档中关于MMA指令的layout说明来处理的,请参考:https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m16n8k16-with-floating-point-type

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants