We present TokenFlow, a unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. TokenFlow introduce an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment through a shared mapping mechanism.
TokenFlow excels in both multimodal understanding and image generation. For multimodal understanding, we surpass the flagship models such as LLaVA-1.5 and EMU3 by a large margin. For text-to-image generation, we also achieve comparable performance to SDXL in 256ร256 resolution.
2024.12.9: Code and checkpoints are released.
2024.12.5: ๐๐๐ TokenFlow is released! ๐๐๐ See our project page and paper .
See GETTING_STARTED.md for detailed instructions of training and evaluation of (1) TokenFlow, (2) multimodal understanding model and (3) text-to-image generation model.
Text-to-Image Model
Model Size | Tokenizer Weight | Model Weight |
---|---|---|
7B | TokenFlow | TokenFlow-t2i |
Multimodal Understanding Model
Language Backbone | Tokenizer Weight | Model Weight |
---|---|---|
Qwen-2.5-14B | TokenFlow-XL | TokenFlow-llava-qwen2.5-14B-finetuning |
- Release the checkpoint of tokenizer, text-to-image model & multimodal understanding model.
- Release the training & inference code for tokenizer.
- Release the training & inference code for text-to-image generation.
- Release the training & inference code for multimodal understanding.
- Release the single-scale version of TokenFlow.
We thank the great work from VAR, LlamaGen and LLaVA.
If our work assists your research, feel free to give us a star โญ or cite us using
@article{qu2024tokenflow,
title={TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation},
author={Qu, Liao and Zhang, Huichao and Liu, Yiheng and Wang, Xu and Jiang, Yi and Gao, Yiming and Ye, Hu and Du, Daniel K and Yuan, Zehuan and Wu, Xinglong},
journal={arXiv preprint arXiv:2412.03069},
year={2024}
}
We are hiring interns and full-time researchers at the ByteFlow Group, ByteDance, with a focus on multimodal understanding and generation. If you are interested, please contact [email protected].