File size: 8,683 Bytes
f304131
 
 
 
7e9bc16
 
 
 
 
f304131
 
 
 
 
 
 
 
 
f4bedc3
 
f304131
 
 
 
 
f4bedc3
 
 
f304131
 
f4bedc3
 
 
 
 
f304131
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4bedc3
f304131
 
 
 
f4bedc3
f304131
 
 
 
 
 
 
 
 
f4bedc3
f304131
f4bedc3
 
f304131
 
 
 
 
 
f4bedc3
f304131
f4bedc3
 
 
 
 
 
 
 
 
f304131
 
 
f4bedc3
f304131
f4bedc3
 
 
 
 
 
 
 
 
f304131
 
 
 
 
f4bedc3
f304131
f4bedc3
 
 
 
 
f304131
 
 
f4bedc3
f304131
f4bedc3
 
 
 
 
 
 
 
 
 
 
 
 
f304131
 
 
f4bedc3
f304131
f4bedc3
 
 
 
 
 
 
 
 
 
 
f304131
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: apache-2.0
language:
- en
datasets:
- AIDC-AI/Ovis-dataset
base_model:
- AIDC-AI/Ovis2-2B
pipeline_tag: any-to-any
---

# Ovis-U1

<div align="center">
  <img src=https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/3IK823BZ8w-mz_QfeYkDn.png width="30%"/>
</div>

<p align="center">
  <a href="https://github.com/AIDC-AI/Ovis-U1/blob/main/docs/Ovis_U1_Report.pdf"><img src="https://img.shields.io/badge/Paper-Tech_Report-b31b1b" alt="paper"></a>
  <a href="https://github.com/AIDC-AI/Ovis-U1"><img src="https://img.shields.io/badge/GitHub-AIDC--AI/Ovis--U1-blue?style=flat&logo=github" alt="code"></a>
  <a href="https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B"><img src="https://img.shields.io/badge/🎨_HF_Spaces-AIDC--AI/Ovis--U1--3B-lightblack" alt="demo"></a>
  <a href="https://huggingface.co/AIDC-AI/Ovis-U1-3B"><img src="https://img.shields.io/badge/πŸ€—_Model-AIDC--AI/Ovis--U1--3B-yellow" alt="model"></a>
</p>


<p align="left">
Building on the foundation of the Ovis series, Ovis-U1 is a 3-billion-parameter unified model that  seamlessly integrates <b>multimodal understanding</b>, <b>text-to-image generation</b>, and <b>image editing</b> within a single powerful framework. 
</p>


<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/636f4c6b5d2050767e4a1491/EmEEGmot9JzaBfHP2uWld.jpeg" width="95%">
  <br>
  <em>The overall architecture of Ovis-U1 (cf. Fig.2 in our report).</em>
</p>


## πŸ“¦ Installation

Ovis-U1 has been tested with Python 3.10, Torch 2.4.0, Transformers 4.51.3, and DeepSpeed 0.15.4. For a comprehensive list of package dependencies, please consult the requirements.txt file.

```bash
git clone [email protected]:AIDC-AI/Ovis-U1.git
conda create -n ovis-u1 python=3.10 -y
conda activate ovis-u1
cd Ovis-U1
pip install -r requirements.txt
pip install -e .

```


## πŸ› οΈ Inference

For multimodal understanding, please run

```bash
python test_img_to_txt.py
```

For text-to-image, please run
```bash
python test_txt_to_img.py \
    --height 1024 \
    --width 1024  \
    --steps 50 \
    --seed 42 \
    --txt_cfg 5  
```

For image editing, please run
```bash
python test_img_edit.py \
    --steps 50 \
    --img_cfg 1.5 \
    --txt_cfg 6  
```

## πŸ“Š Performance

#### OpenCompass Multi-modal Academic Benchmarks

| Model | Avg | MMB | MMS | MMMU | MathVista | Hallusion | AI2D | OCRBench | MMVet | 
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| GPT-4o | **75.4** | **86**  |**70.2** | **72.9** | **71.6** | **57** | **86.3** | 82.2 | **76.9** | 
| InternVL2.5-2B | 59.9 | 70.9 | 54.3 | 43.2 | 51.1 | 42.3 | 74.9 | 80.2 | 62.6 | 
| SAIL-VL-2B | 61 | 73.7 |56.5 | 44.1 | 62.8 | 45.9 | 77.4 | 83.1 | 44.2 | 
| InternVL3-2B | 61.1 | 78 |61.1 | 48.7 | 57.6 | 41.9 | 78.6 | 83.1 | <ins>67</ins> | 
| Qwen2.5-VL-3B | 64.5 | 76.8 | 56.3 | 51.2 | 61.2 | 46.6 | 81.4 | 82.8 | 60 | 
| Ovis2-2B | 65.2 | 76.9 | 56.7 | 45.6 | 64.1 | 50.2 | 82.7 | 87.3 | 58.3 | 
| SAIL-VL-1.5-2B | 67  | 78.5 | 62.6 | 46.4 | 67 | 50 | 83.7 | **89.1** | 58.8 | 
| Ristretto-3B | 67.7 | <ins>80.2</ins> | <ins>62.8</ins> | <ins>51.3</ins> | 67.6 | 50.2 | 84.2 | 84.7 | 60.7 | 
| Ovis-U1 |  <ins>69.6</ins>  | 77.8 |61.3 | 51.1 | <ins>69.4</ins> | <ins>56.3</ins> | <ins>85.6</ins> |  <ins>88.3</ins> | 66.7 | 

#### GenEval

| Model | Overall |Single object | Two object | Counting | Colors | Position | Attribute binding | 
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| GPT-4o | 0.84 | <ins>0.99</ins> | 0.92 | <ins>0.85</ins> | 0.92 | 0.75 | 0.61 | 
| BAGEL | 0.82  | <ins>0.99</ins> | 0.94 | 0.81 | 0.88 | 0.64 | 0.63 | 
| BAGEL πŸ“ | <ins>0.88</ins> | 0.98 | 0.95 | 0.84 | <ins>0.95</ins> | <ins>0.78</ins> | **0.77** |
| UniWorld-V1 | 0.80 | <ins>0.99</ins> | 0.93 | 0.79 | 0.89 | 0.49 | 0.70 |
| UniWorld-V1 πŸ“ | 0.84 | 0.98 | 0.93 | 0.81 | 0.89 | 0.74 | 0.71 | 
| OmniGen | 0.68 |  0.98 | 0.84 | 0.66 | 0.74 | 0.40 | 0.43 | 
| OmniGen2 |0.80 |  **1** | 0.95 | 0.64 | 0.88 | 0.55 | <ins>0.76</ins> | 
| OmniGen2 πŸ“ | 0.86 | <ins>0.99</ins> | <ins>0.96</ins> | 0.74 | **0.98** | 0.71 | 0.75 | 
| Ovis-U1 |**0.89** |  0.98 | **0.98** | **0.90** | 0.92 | **0.79** | 0.75 | 

*πŸ“ denotes using the rewritten prompts*

#### DPG-Bench

| Model | Overall | Global | Entity | Attribute | Relation | Other | 
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| BAGEL | **85.07** | **88.94** | **90.37** | **91.29** | <ins>90.82</ins> | <ins>88.67</ins> | 
| UniWorld-V1 |81.38 |  83.64 | 88.39 | 88.44 | 89.27 | 87.22 | 
| OmniGen |81.16 | 87.90 | 88.97 | 88.47 | 87.95 | 83.56 | 
| OmniGen2 |83.57 | <ins>88.81</ins> | 88.83 | <ins>90.18</ins> | 89.37 | **90.27** | 
| Ovis-U1 | <ins>83.72</ins> | 82.37 | <ins>90.08</ins> | 88.68 | **93.35** | 85.20 |

#### ImgEdit-Bench

| Model | Overall |Add | Adjust | Extract | Replace | Remove | Background | Style | Hybrid | Action | 
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| GPT-4o | **4.2** | **4.61** | **4.33** | <ins>2.9</ins> | <ins>4.35</ins> | <ins>3.66</ins> | **4.57** | **4.93** | **3.96** | **4.89** | 
| MagicBrush | 1.90 | 2.84 | 1.58 | 1.51 | 1.97 | 1.58 | 1.75 | 2.38 | 1.62 | 1.22 | 
| Instruct-P2P | 1.88 | 2.45 | 1.83 | 1.44 | 2.01 | 1.50 | 1.44 | 3.55 | 1.2 | 1.46 | 
| AnyEdit | 2.45 | 3.18 | 2.95 | 1.88 | 2.47 | 2.23 | 2.24 | 2.85 | 1.56 | 2.65 | 
| UltraEdit |2.7 | 3.44 | 2.81 | 2.13 | 2.96 | 1.45 | 2.83 | 3.76 | 1.91 | 2.98 | 
| OmniGen |  2.96 | 3.47 | 3.04 | 1.71 | 2.94 | 2.43 | 3.21 | 4.19 | 2.24 | 3.38 |
| Step1X-Edit |3.06 |  3.88 | 3.14 | 1.76 | 3.40 | 2.41 | 3.16 | 4.63 | 2.64 | 2.52 | 
| ICEdit |3.05 | 3.58 | 3.39 | 1.73 | 3.15 | 2.93 | 3.08 | 3.84 | 2.04 | 3.68 | 
| BAGEL |3.2 | 3.56 | 3.31 | 1.7 | 3.3 | 2.62 | 3.24 | 4.49 | 2.38 | 4.17 | 
| UniWorld-V1 |3.26 | 3.82 | 3.64 | 2.27 | 3.47 | 3.24 | 2.99 | 4.21 | 2.96 | 2.74 | 
| OmniGen2 | 3.44 | 3.57 | 3.06 | 1.77 | 3.74 | 3.2 | 3.57 | <ins>4.81</ins> | 2.52 | <ins>4.68</ins> |
| Ovis-U1 |<ins>4.00</ins> | <ins>4.13</ins> | <ins>3.62</ins> | **2.98** | **4.45** | **4.06** | <ins>4.22</ins> | 4.69 | <ins>3.45</ins> | 4.61 | 


#### GEdit-Bench-EN

|  Model | Avg | Background Change | Color Alteration   | Material Modification  | Motion Change | Portrait Beautification  | Style Transfer  | Subject Addition  | Subject Removal  | Subject Replacement  | Text Modification  | Tone Transformation  | 
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| GPT-4o |**7.534** | 7.205 |	6.491 |	**6.607** | **8.096** |	**7.768** |	<ins>6.961</ins> |	7.622 |	**8.331** |	**8.067** |	**7.427** |	**8.301** |	
| AnyEdit | 3.212 | 4.663	| 4.260 |	2.537 |	2.024 |	3.479	| 2.032 |	3.995 |	3.089 |	3.180 |	0.922 |	5.151 |	
| Instruct-Pix2Pix | 	3.684 | 3.825 |	5.182 |	3.688 |	3.509 |	4.339 |	4.560 |	3.461 |	2.031 |	4.237 |	0.955 |	4.733 |
| MagicBrush |4.518 |	5.637 |	5.136 |	5.078 |	4.513 |	4.487 |	4.439 |	5.252 |	3.704 |	4.941 |	1.384 |	5.130 |	
| OmniGen | 5.062 | 5.281 |	6.003 |	5.308 |	2.916 |	3.087 |	4.903 |	6.628 |	6.352 |	5.616 |	4.519 |	5.064 |	
| Gemini |6.315 | 	6.781 |	6.369 |	6.040 |	6.938 |	5.591 |	4.676 |	7.501 |	6.447 |	7.003 |	5.765 |	6.350 |	
| Step1X-Edit |	6.701 | 6.547 |	6.545 |	6.204 |	6.483 |	6.787 |	**7.221** |	6.975 |	6.512 |	7.068 |	<ins>6.921</ins> |	6.448 |	
| Doubao |<ins>6.754</ins> | 	<ins>7.430</ins> |	**7.095** |	6.339 |	<ins>6.973</ins> |	<ins>6.972</ins> |	6.767 |	<ins>7.674</ins> |	6.748 |	<ins>7.447</ins> |	3.471 |	<ins>7.383</ins> |	
| BAGEL | 6.519 | 7.324 |	<ins>6.909</ins> |	<ins>6.381</ins> |	4.753 |	4.573 |	6.150 |	**7.896** |	7.164 |	7.021 |	7.320 |	6.218 |	
| Ovis-U1 |6.420 | **7.486** |	6.879 |	6.208 |	4.790 |	5.981 |	6.463 |	7.491 |	<ins>7.254</ins> |	7.266 |	4.482 |	6.314 |	


## πŸ“š Citation

If you find Ovis-U1 useful, please cite our paper:

```bibtex
@inproceedings{wang2025ovisu1,
title={Ovis-U1 Technical Report},
author={Ovis Team},
year={2025}
}
```

## πŸ™ Acknowledgments

The code is built upon [Ovis](https://github.com/AIDC-AI/Ovis) and [FLUX](https://github.com/black-forest-labs/flux).

## πŸ“„ License

The project is released under Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0, SPDX-License-identifier: Apache-2.0).

## 🚨 Disclaimer

We used compliance checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.