使用Nsight分析GPU程序

请设置文章作者

发布于：Feb 28, 2024

Cheat sheet

GPU利用率
- compute bound => SM利用率高, memory利用率低
- memory bound => SM利用率低, memory利用率高
- latency bound => SM和memory利用率都不高
  - warp数写得不够多
Breakdown(右上角切换视图)
- 利用率是breakdown中metrics的最大值

添加参数--section ComputeWorkloadAnalysis, 或者直接--set full

为什么硬件单元没忙

没个周期的数据

如果每个周期都能发射一条指令, 那希望Issued warps = 1 => 看Warp state分析

warp被阻塞的通常原因:

stall reason

stall long scoreboard
- 从off-chip(global memory)的地方读取数据
stall short scoreboard
- 从距离相对短的地方(e.g. share memory, 特殊mfu, 动态分支等)读取数据
LG Throttle
- 等待L1 inst queue: 执行local/global memory指令太频繁, 阻塞
- 指令队列导致的stall, 单条指令时间长(e.g. 读写gmem)
MIO Throttle
- 被mem input/output指令阻塞, e.g. LDS, MUFU, 动态分支
- IO导致的stall
Stall Not Selected
- 一个warp可发射, 但是选择了其他warp。说明warp数量很高，可以适当降低
Stall Memory Throttle
- 一个warp不可发射, 因为LSU pipe被占用。说明smem的使用有gap, 可能是bank conflict或warp divergence
Stall No Instruction
- means the SMs could not be fed instructions fast enough from memory.
- instruction caches miss => 指令太多样了
  - warp间的执行相互影响
  - 区分prologue, main loop, epilogue，保证指令执行的相似性和局部性

HPC

手把手实现Ring Attention

手把手实现Ring Attention 官方ring attention是jax实现的，这里用pytorch实现一个ring attention的学习版本。 source code 是ring ...

RoPE CheatSheet

rope位置编码对序列进行重新编码, $E = {f(x_i, i)}^N_{i=1}$。f是一个位置编码函数，接收单个token x和其位置i。 tips 是对一个向量(hidden st...