66Ring's Blog

Intel AVX

Intel AVX Intel® Intrinsics Guide is all you need. SSE: Streaming SIMD Extensions AVX: Advanced Vector Extensions flow(类似于GPU编程) 创建SIMD寄存器加载数据到SIMD寄存器使用SIMD指令将SIMD寄存器中的数据存回内存 Terms 命名格式 ...

2024-11-04

Cheatsheet https://wax8280.github.io/2017/09/22/%E5%85%B3%E4%BA%8E%E5%BA%95%E7%89%87%E7%9A%84%E5%86%B2%E6%B4%97/ https://i50mm.com/skills-%E6%8A%80%E5%B7%A7/%E5%85%B3%E4%BA%8E%E9%BB%91%E7%99%BD%E8%...

2024-10-17

shfl, warp-level primitives

shfl: warp-level primitives一个warp有32个thread, warp内的线程称为通道(lanes), lane id的计算方法是threadid % 32, warp id的计算方法是threadid / 32。线程束洗牌: warp-level原语可以直接获取warp内的线程的寄存器值，直接使用寄存器交换每次调用都会同步warp内的线程, sync m...

2024-08-11

bank confict和冲突消解

bank confict和冲突消解 bank conflict 4Byte一个bank 简单方法 ldmatrix swizzle GPU为了提升并行度，可以提供了同时访问share memory功能，多个线程访问smem的不同bank可以并行，N个线程访问同一个bank就会串行执行，这就是bank conflict，称为N路bank conflict。假设GPU中4 Byte一个b...

2024-08-10

构建自己的数据集

Abstract1234567891011121314151617import pandas as pdimport torchfrom pathlib import Pathfrom dataclasses import dataclassfrom torch.utils.data import Datasetfrom PIL import Imagedata_path = Path("p...

2024-07-28

Python setup.py和开发流程

Python setup.py和开发流程cheat sheet 1234567# setup.pyfrom setuptools import setup, find_packagessetup( name="pip_pkgs_name", # pip管理的包名, 如pip show pip_pkgs_name version="0.0.1", packages=["pkg...

2024-06-05

Flash attention变长batching API使用

Flash attention变长batching API使用主要记录flash_attn.flash_attn_varlen_func这个接口的使用, 精髓在于理解函数签名和输入形状: 函数签名需要每个seq的offset, 输入形状需要(bs, seqlen)平坦化后的(total_num, nhead, headdim) 1from flash_attn import flash_at...

2024-05-31

Label Words are Anchors An Information Flow Perspective for Understanding In-Context Learning深度解析

深入理解Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learningabs 只测试了GPT模型探究了ICL(in context learning)如何学习上下文的机制提出”Information Flow with Labels as Anchors”假说...

2024-05-21

Mixtral MoE源码笔记

Mixtral MoE源码笔记 transformers/src/transformers/models/mixtral/modeling_mixtral.py 注意是mixtral不是mistral 和llama基本相同, 主要区别只在与MLP: 混合专家中的MLP有num_experts个mlp, 而llama只有一个mlp。核心代码在于MixtralSparseMoeBlock。 1...

2024-05-10

cutlass cute实现flash attention

用cutlass cute实现flash attentionflash attention自顶向下(虽然我学cutlass是自底向上学的但是感觉快速上手应该自顶向下学)。因为有了cutlass cute用户就可以方便的实现一些功能了, 即一些cuda编程的范式: cuda程序范式: global mem -> share mem -> reg -> compute blo...

2024-05-08