PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation

Rongyao Fang^1*, Chengqi Duan^2*, Kun Wang³, Hao Li^1,4, Hao Tian³, Xingyu Zeng³, Rui Zhao³, Jifeng Dai^4,5, Hongsheng Li^{1 :envelope:}, Xihui Liu^{2 :envelope:}

¹CUHK MMLab, ²HKU MMLab, ³SenseTime, ⁴Shanghai AI Laboratory, ⁵Tsinghua University

*Equal contribution, :envelope:Corresponding authors

Environment Setup

conda create -n puma python==3.8
conda activate puma
pip install -r requirements.txt

Checkpoint Download

# You should first replace the <token> with your huggingface token
python download_ckpt.py

For manual downloads, please download checkpoints from here and put the checkpoints under ./ckpts.

Multi-granular Visual Decoding

python infer_detokenizer.py --num_tokens <chosen number from [1, 4, 16, 64, 256]>

Abstract

PUMA introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks.

Read the full paper here.

Framework

PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding.

Multi-granular Semantic Visual Decoding

PUMA's visual decoding process spans five granular image representations (f₀ to f₄) and corresponding decoders (D₀ to D₄), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks.

Diverse Text-to-image Generation

Image Editing

Image Conditional Generation

Citation

If you find PUMA useful in your research, please consider citing us:

@article{fang2024puma,
  title     ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation},
  author    ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu},
  journal   ={arxiv},
  year      ={2024}
}

License

This project is released under the Apache 2.0 license.