license: apache-2.0
pipeline_tag: any-to-any

Improved Native Unified Multimodal Models
Jinheng Xie1 Zhenheng Yang2 Mike Zheng Shou1
1 Show Lab, National University of Singapore 2 Bytedance
What is the new about Show-o2?
We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation.
Pre-trained Model Weigths
The Show-o2 checkpoints can be found on Hugging Face:
Getting Started
First, set up the environment:
bash build_env.sh
Login your wandb account on your machine or server.
wandb login <your wandb keys>
Download Wan2.1 3D causal VAE model weight here and put it on the current directory.
Demo for Multimodal Understanding and you can find the results on wandb.
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
mmu_image_path=./docs/mmu/pexels-fotios-photos-2923436.jpg question='请告诉我图片中写着什么?'
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'
Demo for Text-to-Image Generation and you can find the results on wandb.
python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
python3 inference_t2i.py config=configs/showo2_1.5b_demo_512x512.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
python3 inference_t2i.py config=configs/showo2_1.5b_demo_432x432.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
python3 inference_t2i.py config=configs/showo2_7b_demo_432x432.yaml \
batch_size=4 guidance_scale=7.5 num_inference_steps=50;
Citation
To cite the paper and model, please use the below:
@article{xie2025showo2,
title={Show-o2: Improved Native Unified Multimodal Models},
author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
journal={arXiv preprint},
year={2025}
}
Acknowledgments
This work is heavily based on Show-o.