Improved Native Unified Multimodal Models
[Jinheng Xie](https://sierkinhane.github.io/)
1
[Zhenheng Yang](https://scholar.google.com/citations?user=Ds5wwRoAAAAJ&hl=en)
2
[Mike Zheng Shou](https://sites.google.com/view/showlab)
1
1 [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore
2 Bytedance
[](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/tree/main/show-o2) [](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
## What is the new about Show-o2?
We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**