DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Minghong Cai^{1 †}, Xiaodong Cun², Xiaoyu Li^{3 ✉}, Wenze Liu¹, Zhaoyang Zhang³, Yong Zhang⁴, Ying Shan³, Xiangyu Yue^{1 ✉}
¹MMLab, The Chinese University of Hong Kong ²GVC Lab, Great Bay University ³ARC Lab, Tencent PCG ⁴Tencent AI Lab
†: Intern at ARC Lab, Tencent PCG, ✉: Corresponding Authors

📋 News

2024.12.24 Release Code and demo on CogVideoX-2B!

🔆Demo

Longer Multi-Prompt Text-to-video Generation

Longer Single-Prompt Text-to-video Generation

Our method can naturally work on single-prompt longer video generation by setting sequential multi-prompts as the same. This shows that our method can enhance the consistency of single prompt in long video generation.

Video Editing

Removing our latent blending strategy of our approach DiTCtrl, we can achieve the video editing performance of Word Swap like prompt-to-prompt. Specifically, we just use KV-sharing strategy to share keys and values from source prompt P_source branch, so that we can synthesize a new video to preserve the original composition while also addressing the content of the new prompt P_target.

Similar to prompt-to-prompt, through reweighting the specific columns and rows corresponding to specified token (e.g. "pink") in the MM-DiT's Text-Video attention and Video-Text attention, we can also achieve the video editing performance of Reweight.

🎏 Abstract

TL; DR: DiTCtrl is the first tuning-free approach based on MM-DiT architecture for coherent multi-prompt video generation. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions.

CLICK for the full abstract

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer (MM-DiT) architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

🚧 Todo

Click for Previous todos

Release Code based on CogVideoX-2B

Release paper on arxiv
Benchmark metrics and prompts
Release code of diffuser version of CogVideoX-2B
Release code based on CogVideoX-5B
Release code based on HunyuanVideo

🛡 Setup Environment

Our method is tested using CUDA12, on a single A100 or V100.

cd DiTCtrl

conda create -n ditctrl python=3.10
conda activate ditctrl

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121

pip install -r requirements.txt

conda install https://anaconda.org/xformers/xformers/0.0.28.post1/download/linux-64/xformers-0.0.28.post1-py310_cu12.1.0_pyt2.4.1.tar.bz2

Our environment is similar to CogVideo. You may check them for more details.

⚙️ Download CogVideoX-2B Model Weights

First, download CogVideoX-2B model weights, download as follows, which is copied from CogVideoX:

cd sat
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
mv 'index.html?dl=1' vae.zip
unzip vae.zip
wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
mv 'index.html?dl=1' transformer.zip
unzip transformer.zip

Arrange the model files in the following structure:

CogVideoX-2b-sat/
├── transformer
│   ├── 1000 (or 1)
│   │   └── mp_rank_00_model_states.pt
│   └── latest
└── vae
    └── 3d-vae.pt

Since model weight files are large, it’s recommended to use git lfs.
See here for git lfs installation.

git lfs install

Next, clone the T5 model, which is used as an encoder and doesn’t require training or fine-tuning.

You may also use the model file location on Modelscope.

git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl

This will yield a safetensor format T5 file that can be loaded without error during Deepspeed fine-tuning.

├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json

0 directories, 8 files

❓ FAQ

Q: I'm getting a safetensors rust.SafetensorError: Error while deserializing header: HeaderTooLarge error. What should I do?

A: It's because the T5 model not downloaded correctly. Please check the filesize of the t5-v1_1-xxl folder, it should be around 8.9GB. Otherwise, you may be influenced by huggingface network. You can go to hf-mirror by the following command:

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download THUDM/CogVideoX-2b --local-dir ./CogVideoX-2b

Finally, your file structure should be like this:

sat/
├── CogVideoX-2b-sat/
  ├── transformer
  ├── CogVideoX-2b
  ├── t5-v1_1-xxl
  ├── vae
├── configs/
├── inference_case_configs/
├── run_multi_prompt.sh
├── run_single_prompt.sh
├── run_edit_video.sh 
├── sample_video.py
├── sample_video_edit.py
├── README.md
├── LICENSE
├── ...

💫 Get Started

1. Longer Multi-Prompt Text-to-Video

  cd sat
  bash run_multi_prompt.sh

2. Longer Single-Prompt Text-to-Video

  cd sat
  bash run_single_prompt.sh

3. Video Editing

  cd sat
  bash run_edit_video.sh

Custom config

Take the run_multi_prompt.sh as an example:

inference_case_config="inference_case_configs/multi_prompts/rose.yaml"
run_cmd="$environs python sample_video.py --base configs/cogvideox_2b.yaml configs/inference.yaml --custom-config $inference_case_config"
echo ${run_cmd}
eval ${run_cmd}

The custom config is the config file in the inference_case_configs folder. inference_case_configs is the folder where you put your custom config files, which can overwrite the default config in the configs/inference.yaml folder.

Take the rose.yaml as an example:

args:
  is_run_isolated: False  # If True, will generate the isolated videos not using our method
  seed: 42
  output_dir: outputs/multi_prompt_case/rose  # The output directory
  prompts:     # Put your prompts here to generate multi-prompt long videos
    - "A gentle close shot of the same rose petal, where the camera gradually pulls back to reveal the entire unfurling bloom in its perfect symmetry."
    - "A steady medium shot of the rose, where the camera continues retreating to show the full stem with its leaves and neighboring buds."
    - "A smooth full shot of the rose bush, where the camera moves further back to encompass the entire garden bed and surrounding flowering plants."

More details about the custom config, please refer to the configs/inference.yaml file. When you run the command, it will generate the video in the outputs/multi_prompt_case/rose folder.

How to create your own prompts by Large Language Model

Single-prompts: Please refer to the CogvideoX instruction.

Multi-prompts: First, you can refer to our prompts case in the inference_case_configs/multi_prompts folder to get inspiration. Then, we provide two instruction files in the prompts_gen_instruction folder to generate your own multi-prompts. You can try both of them and chat with the LLM to get the best prompts.

Presto: Modified from Presto's instruction, focusing on realistic cinematographic sequences with natural camera movements and temporal progression (ideal for documentary-style or realistic scenarios).
DitCtrl: Our custom instruction for DiTCtrl, emphasizing creative scene transitions and imaginative scenarios (perfect for artistic and fantasy-based video generation).

😉 Citation

@article{cai2024ditctrl,
  title     = {DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation},
  author    = {Cai, Minghong and Cun, Xiaodong and Li, Xiaoyu and Liu, Wenze and Zhang, Zhaoyang and Zhang, Yong and Shan, Ying and Yue, Xiangyu},
  journal   = {arXiv:2412.18597},
  year      = {2024},
}

📚 Acknowledgements

Our codebase builds on CogVideoX, MasaCtrl, MimicMotion, FreeNoise, and prompt-to-prompt. Thanks to the authors for sharing their awesome codebases! Thanks to concurrent training-based work Presto for providing the scene description instruction, and the first case is inspired by the scene description from Presto. Thanks for the great work!

License

This project is released under LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
assets		assets
prompts_gen_instruction		prompts_gen_instruction
sat		sat
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

📋 News

🔆Demo

Longer Multi-Prompt Text-to-video Generation

Longer Single-Prompt Text-to-video Generation

Video Editing

🎏 Abstract

🚧 Todo

🛡 Setup Environment

⚙️ Download CogVideoX-2B Model Weights

❓ FAQ

💫 Get Started

1. Longer Multi-Prompt Text-to-Video

2. Longer Single-Prompt Text-to-Video

3. Video Editing

Custom config

How to create your own prompts by Large Language Model

😉 Citation

📚 Acknowledgements

License

About

Releases

Packages

Languages

License

TencentARC/DiTCtrl

Folders and files

Latest commit

History

Repository files navigation

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

📋 News

🔆Demo

Longer Multi-Prompt Text-to-video Generation

Longer Single-Prompt Text-to-video Generation

Video Editing

🎏 Abstract

🚧 Todo

🛡 Setup Environment

⚙️ Download CogVideoX-2B Model Weights

❓ FAQ

💫 Get Started

1. Longer Multi-Prompt Text-to-Video

2. Longer Single-Prompt Text-to-Video

3. Video Editing

Custom config

How to create your own prompts by Large Language Model

😉 Citation

📚 Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages