Update README.md
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@
|
|
15 |
|
16 |
## What is the new about Show-o2?
|
17 |
We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
|
18 |
-
|
19 |
|
20 |
## Pre-trained Model Weigths
|
21 |
The Show-o2 checkpoints can be found on Hugging Face:
|
|
|
15 |
|
16 |
## What is the new about Show-o2?
|
17 |
We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
|
18 |
+
<img src="overview.png" width="1000">
|
19 |
|
20 |
## Pre-trained Model Weigths
|
21 |
The Show-o2 checkpoints can be found on Hugging Face:
|