--- license: apache-2.0 pipeline_tag: any-to-any ---

Improved Native Unified Multimodal Models

[Jinheng Xie](https://sierkinhane.github.io/)1  [Zhenheng Yang](https://scholar.google.com/citations?user=Ds5wwRoAAAAJ&hl=en)2  [Mike Zheng Shou](https://sites.google.com/view/showlab)1 1 [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore  2 Bytedance  [![ArXiv](https://img.shields.io/badge/Arxiv-<2506.15564>-.svg)](https://arxiv.org/abs/2506.15564) [![ArXiv](https://img.shields.io/badge/Code--.svg)](https://github.com/showlab/Show-o/tree/main/show-o2) [![WeChat badge](https://img.shields.io/badge/微信-加入-green?logo=wechat&)](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
## What is the new about Show-o2? We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.** ## Pre-trained Model Weigths The Show-o2 checkpoints can be found on Hugging Face: * [showlab/show-o2-1.5B](https://huggingface.co/showlab/show-o2-1.5B) * [showlab/show-o2-1.5B-HQ](https://huggingface.co/showlab/show-o2-1.5B-HQ) * [showlab/show-o2-7B](https://huggingface.co/showlab/show-o2-7B) ## Getting Started First, set up the environment: ``` bash build_env.sh ``` Login your wandb account on your machine or server. ``` wandb login ``` Download Wan2.1 3D causal VAE model weight [here](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/Wan2.1_VAE.pth) and put it on the current directory. Demo for **Multimodal Understanding** and you can find the results on wandb. ``` python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \ mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.' python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \ mmu_image_path=./docs/mmu/pexels-fotios-photos-2923436.jpg question='请告诉我图片中写着什么?' python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \ mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.' ``` Demo for **Text-to-Image Generation** and you can find the results on wandb. ``` python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \ batch_size=4 guidance_scale=7.5 num_inference_steps=50; python3 inference_t2i.py config=configs/showo2_1.5b_demo_512x512.yaml \ batch_size=4 guidance_scale=7.5 num_inference_steps=50; python3 inference_t2i.py config=configs/showo2_1.5b_demo_432x432.yaml \ batch_size=4 guidance_scale=7.5 num_inference_steps=50; python3 inference_t2i.py config=configs/showo2_7b_demo_432x432.yaml \ batch_size=4 guidance_scale=7.5 num_inference_steps=50; ``` ### Citation To cite the paper and model, please use the below: ``` @article{xie2025showo2, title={Show-o2: Improved Native Unified Multimodal Models}, author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng}, journal={arXiv preprint}, year={2025} } ``` ### Acknowledgments This work is heavily based on [Show-o](https://github.com/showlab/show-o).