MoE基本实现in short
topk expert选择
linear(dim, num expert)赋权
topk at dim -1
permute: 让同一个expert的token揍在一起, 方便做一组mlp(grouped gemm)
技巧:
topk_ids.view(num_token, topk).view(-1).argsort()会根据topk排序, 相同exp...
深入理解Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learningabs
只测试了GPT模型
探究了ICL(in context learning)如何学习上下文的机制
提出”Information Flow with Labels as Anchors”假说...