CogVideoX模型walk through
shape flow
- transformers
- hidden_states.shape = (batch_size, num_frames, channels, height, width)
- hidden_states.patch_embed(encoder_hidden_states, hidden_states)
- encoder_hidden_states.shape = ()
- text_proj, dim维度proj一下
- (bsz, seqlen, dim)
- image patch
- image_embeds.reshape(-1, channels, height, width), bsz和num_frames合并
- image_proj, 卷积或者linear从channels维度把后面的维度”合并处理”, 整理成(batch, num_frames, height x width, channels)
- (batch, num_frames x height x width, channels)
- encoder_hidden_states.shape = ()
- 最终embeds = torch.cat([text_embeds, image_embeds], dim=1).contiguous() # [batch, seq_length + num_frames x height x width, channels]
- transformers
pipe
- prompt_embeds.shape
1 | CogVideoXBlock: |
QA
- scheduler的功能: 准备latent, step降噪 更新latent, 记录timestep
- scale_model_input(sample, time): latent预处理, 也可以什么都不做
- step(noise_pred, old_pred_original_sample, timestep, timestep_back, latent): 降噪
- encoder_hidden_states怎么来
- latent
- review降噪流程
- latent encode, text encode
- Denoising loop
- latent_model_input = self.scheduler.scale_model_input(latents, t) # 准备latent, aka image encode
- noise_pred = model.call(latent_model_input, text_encode, timestep)
- latents, old_pred_original_sample = self.scheduler.step(noise_pred) # 降噪
- time, offset embeding怎么用:
- norm中经过linear后以scale和offset的方式添加到hidden state中
- num_frames和timestep的关系, num_frames怎么影响生成的?
- num_frames会一次创建shape(bsz, scale(num_frame), num_channels, scale(height), scale(width))
- 会根据num_frame创建出image的hidden state, 即latent
- 最后text和image的hidden state会concat起来, image的hidden state会以num_frames维度铺开
embeds = torch.cat([text_embeds, image_embeds], dim=1).contiguous() # [batch, seq_length + num_frames x height x width, channels](batch, seq_length + num_frames x height x width, channels)
- timestep则相当于”gen_len”的作用, 控制降噪几次