Tensor core MMA指令教程
以mma.m8n8k4为例
A warp executing mma.m8n8k4 with .f16 floating point type will compute 4 MMA operations of shape .m8n8k4.
一个warp可以同时执行4个m8n8k4的MMA。买个MMA相互独立, 不reduce。
- MMA1: Threads with %laneid 0-3 (low group) and 16-19 (high group)
- MMA2: Threads with %laneid 4-7 (low group) and 20-23 (high group)
- MMA3: Threads with %laneid 8-11 (low group) and 24-27 (high group)
- MMA4: Threads with %laneid 12-15 (low group) and 28-31 (high group)
不同MMA组负责各自m8n8k4矩阵的计算, 最后通过thread value的layout把矩阵元素的位置还原。
MMA使用流程
- 找到A B矩阵的TVlayout, 然后把元素传入对应的thread中
- e.g. Figure 45 MMA .m8n8k4 fragment layout for row-major matrix A with
- MMA1中T0-T3负责前4行, T16-T19负责后4行
- 映射公式一般图片下面会给出
1
2
3
4
5
6(row, col)表示元素在矩阵中的位置, 需要搬运到thread的寄存器中
i 表示thread中的value的index
row = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwise
col = i for ai where i = {0,..,3}
- 检查限定符, 如
.col.row等 - ptx调用
- 文档会说明寄存器要如何传入, 例如两个f16要pack到一起
- 这里m8n8k4的A矩阵的thread要传入4个f16, B矩阵的thread要传入4个f16, C矩阵的thread要传入8个f16
1
2
3
4
5
6
7
8
9
10
11asm volatile("mma.sync.aligned.m8n8k4.row.col.f32.f16.f16.f32"
"{%0, %1, %2, %3, %4, %5, %6, %7},"
"{%8, %9},"
"{%10, %11},"
"{%12, %13, %14, %15, %16, %17, %18, %19};\n"
: "=f"(d[0]), "=f"(d[1]), "=f"(d[2]), "=f"(d[3]),
"=f"(d[4]), "=f"(d[5]), "=f"(d[6]), "=f"(d[7])
: "r"(a[0]), "r"(a[1]),
"r"(b[0]), "r"(b[1]),
"f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3]),
"f"(c[4]), "f"(c[5]), "f"(c[6]), "f"(c[7]));
- 结果位置还原
- e.g. Figure 50 MMA .m8n8k4 computation 1 and 2 fragment layout for matrix C/D with
- 计算完成后thread value的对应位置如图, 需要存储到C矩阵中
- 图片下方会给出位置映射公式
1
2
3
4
5
6row = X if %laneid < 16
X + 4 otherwise
where X = (%laneid & 0b1) + (i & 0b10) for ci where i = {0,..,7}
col = (i & 0b100) + (%laneid & 0b10) + (i & 0b1) for ci where i = {0,..,7}
ptx指令规则: 1. 顺序placeholder 2. 寄存器类型(处理r表示regular/u32, 其他都是常见的缩写)
1 | "h" = .u16 reg -> half |
完整代码
1 | // bang!:run:term make && %:p:h/build/%:t:r |