Cutlass Tiled Copy
Copy is all you need.
make_tiled_copy语义理解。核心在于: tiler和layout_tv。先说结论: 用atom去对tv layout进行分tile。用tiler去对目标tensor进行分tile。最后将这两个layout组合得到新的tv layout,表示tile-wise的访问tv, v的layout能够保证满足tiler的逻辑切分。
我们想要连续访存, 所有用atom去tilelayout_tv
我们想要分块计算, 所有用tiler去tile目标tensor
当用Tiler进行分tile后, tv中的v的布局将方法改变, 所以最后体现的是一个新的tv layout, 其中的v的layout是tiler后的逻辑layout
快速上手(Atom-wise访问tv):
1 2 3 4 5 6 // (Thr,(FrgV,FrgX),(RestM,RestN,...)) // Thr: The logical threads within the tiled copy. // FrgV: The values local to a COPY_ATOM Dst. // FrgX: The values tiled across COPY_ATOMs Dst. // RestM: The values tiled in M. // RestN: The values tiled in N.
Terms
zipped_divide(layout, tile) = (tile, rest)
layout用tile取切并在tile内根据tile重新映射
right_inverse(layout), 返回一个新layoutresult, 有layout(result(i)) = i
AKA: 如果原始映射会把(0...n)映射成(x0...xn), 求一个映射: 当输入是(x0...xn)时输出是(0...n)
cute中所有的layout都是通过逻辑idx联系起来的, new = right_inverse(a_layout)可以理解成当new的输入的逻辑idx时输出是a_layout
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 void _test_tidfrg_D() { using copy_op = UniversalCopy<cute::uint32_t >; using copy_traits = Copy_Traits<copy_op>; using copy_atom = Copy_Atom<copy_traits, float >; constexpr Tensor dtensor = make_identity_tensor(make_shape(Int<24 >{}, Int<16 >{})); constexpr auto thr_layout = make_layout(make_shape(Int<8 >{}, Int<16 >{})); constexpr auto val_layout = make_layout(make_shape(Int<2 >{}, Int<4 >{})); constexpr int tid = 0 ; auto tiled_copy = make_tiled_copy(copy_atom{}, thr_layout, val_layout); using TiledCopy = decltype (tiled_copy); auto tiler_mn = TiledCopy::Tiler_MN{}; auto tiled_tensor = zipped_divide(dtensor, TiledCopy::Tiler_MN{}); auto ref2trg = right_inverse(TiledCopy::AtomLayoutRef{}).compose(TiledCopy::AtomLayoutDst{}); auto atom_layout_TV = zipped_divide(TiledCopy::TiledLayout_TV{}, make_shape(TiledCopy::AtomNumThr{}, TiledCopy::AtomNumVal{})); auto trg_layout_TV = atom_layout_TV.compose(ref2trg, _); auto _step_zip0 = zip(trg_layout_TV); auto thrval2mn = coalesce(zip(trg_layout_TV), Shape<_1,Shape<_1,_1>>{}); auto tv_tensor = tiled_tensor.compose(thrval2mn, _); auto unfold = tv_tensor(make_coord(_,_), _); }