We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
https://github.com/DefTruth/CUDA-Learn-Notes/blob/main/kernels/hgemm/mma/basic/hgemm_mma.cu#L94
关于这几行代码,想了很久没有完全理解,请教您一下: uint32_t load_smem_a_ptr = __cvta_generic_to_shared(&s_a[lane_id % 16][(lane_id / 16) * 8]); LDMATRIX_X4(RA[0], RA[1], RA[2], RA[3], load_smem_a_ptr);
问题1: s_a[lane_id % 16][(lane_id / 16) * 8]这里对应的位置是[0~15][0,8], 但实际lane_id为0时取到s_a[0][0],lane_id为1时取到s_a[1][0],我想问为什么不是取s_a[0][8],这样取内存不应该才是连续的吗?
问题2: 这里s_a的索引[0~15][0,8],意味着每个线程取8个half,但是这几行代码里面,并没有像LDST128BITS这样的字眼,是哪里给自动分配吗,还是其他机制?
问题3: 关于load_smem_a_ptr,到底是什么,没有理解透彻,因为每个线程的land_id都不一样,但是都返回给load_smem_a_ptr,这个load_smem_a_ptr没有加任何的偏移,你能讲讲偏移去哪了吗?
The text was updated successfully, but these errors were encountered:
这是根据PTX文档中关于MMA指令的layout说明来处理的,请参考:https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m16n8k16-with-floating-point-type
Sorry, something went wrong.
No branches or pull requests
https://github.com/DefTruth/CUDA-Learn-Notes/blob/main/kernels/hgemm/mma/basic/hgemm_mma.cu#L94
关于这几行代码,想了很久没有完全理解,请教您一下:
uint32_t load_smem_a_ptr = __cvta_generic_to_shared(&s_a[lane_id % 16][(lane_id / 16) * 8]);
LDMATRIX_X4(RA[0], RA[1], RA[2], RA[3], load_smem_a_ptr);
问题1:
s_a[lane_id % 16][(lane_id / 16) * 8]这里对应的位置是[0~15][0,8],
但实际lane_id为0时取到s_a[0][0],lane_id为1时取到s_a[1][0],我想问为什么不是取s_a[0][8],这样取内存不应该才是连续的吗?
问题2:
这里s_a的索引[0~15][0,8],意味着每个线程取8个half,但是这几行代码里面,并没有像LDST128BITS这样的字眼,是哪里给自动分配吗,还是其他机制?
问题3:
关于load_smem_a_ptr,到底是什么,没有理解透彻,因为每个线程的land_id都不一样,但是都返回给load_smem_a_ptr,这个load_smem_a_ptr没有加任何的偏移,你能讲讲偏移去哪了吗?
The text was updated successfully, but these errors were encountered: