Ovis-U1-3B / README.md
Flourish's picture
Upload folder using huggingface_hub
f304131 verified
|
raw
history blame
8.47 kB
metadata
license: apache-2.0
language:
  - en

Ovis-U1

demo demo model

Building on the foundation of the Ovis series, Ovis-U1 is a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities.

Ovis-U1 architecture
The overall architecture of Ovis-U1 (cf. Fig.2 in our report).

πŸš€ News

  • [2025/6/28] πŸ”₯ Announcing Ovis-U1-3B (Model, Demo)!

πŸ“¦ Installation

Ovis-U1 has been tested with Python 3.10, Torch 2.4.0, Transformers 4.51.3, and DeepSpeed 0.15.4. For a comprehensive list of package dependencies, please consult the requirements.txt file.

git clone [email protected]:AIDC-AI/Ovis-U1.git
conda create -n ovis-u1 python=3.10 -y
conda activate ovis-u1
cd Ovis-U1
pip install -r requirements.txt
pip install -e .

πŸ“‚ Model Checkpoints

We provide pretrained Ovis-U1-3B checkpoints for easy download and evaluation:

  • Model Repository: Hugging Face

πŸ› οΈ Inference

For multimodal understanding, please run

python ovis/eval/test_txt_generation.py

For text-to-image, please run

python ovis/eval/test_t2i.py \
    --height 1024 \
    --width 1024  \
    --steps 50 \
    --seed 42 \
    --txt_cfg 5  

For image editing, please run

python ovis/eval/test_img_edit.py \
    --steps 50 \
    --img_cfg 4 \
    --txt_cfg 7.5  

πŸ“Š Performance

OpenCompass Multi-modal Academic Benchmarks

Model MMB MMS MMMU MathVista Hallusion AI2D OCRBench MMVet Avg
GPT-4o 86 70.2 72.9 71.6 57 86.3 82.2 76.9 75.4
InternVL2.5-2B 70.9 54.3 43.2 51.1 42.3 74.9 80.2 62.6 59.9
SAIL-VL-2B 73.7 56.5 44.1 62.8 45.9 77.4 83.1 44.2 61
InternVL3-2B 78 61.1 48.7 57.6 41.9 78.6 83.1 67 61.1
Qwen2.5-VL-3B 76.8 56.3 51.2 61.2 46.6 81.4 82.8 60 64.5
Ovis2-2B 76.9 56.7 45.6 64.1 50.2 82.7 87.3 58.3 65.2
SAIL-VL-1.5-2B 78.5 62.6 46.4 67 50 83.7 89.1 58.8 67
Ristretto-3B 80.2 62.8 51.3 67.6 50.2 84.2 84.7 60.7 67.7
Ovis-U1 77.8 61.3 51.1 69.4 56.3 85.6 88.3 66.7 69.6

GenEval

Model Single object Two object Counting Colors Position Attribute binding Overall
GPT-4o 0.99 0.92 0.85 0.92 0.75 0.61 0.84
BAGEL 0.99 0.94 0.81 0.88 0.64 0.63 0.82
BAGEL πŸ“ 0.98 0.95 0.84 0.95 0.78 0.77 0.88
UniWorld-V1 0.99 0.93 0.79 0.89 0.49 0.70 0.80
UniWorld-V1 πŸ“ 0.98 0.93 0.81 0.89 0.74 0.71 0.84
OmniGen 0.98 0.84 0.66 0.74 0.40 0.43 0.68
OmniGen2 1 0.95 0.64 0.88 0.55 0.76 0.80
OmniGen2 πŸ“ 0.99 0.96 0.74 0.98 0.71 0.75 0.86
Ovis-U1 0.98 0.98 0.90 0.92 0.79 0.75 0.89

πŸ“ denotes using the rewritten prompts

DPG-Bench

Model Global Entity Attribute Relation Other Overall
BAGEL 88.94 90.37 91.29 90.82 88.67 85.07
UniWorld-V1 83.64 88.39 88.44 89.27 87.22 81.38
OmniGen 87.90 88.97 88.47 87.95 83.56 81.16
OmniGen2 88.81 88.83 90.18 89.37 90.27 83.57
Ovis-U1 82.37 90.08 88.68 93.35 85.20 83.72

ImgEdit-Bench

Model Add Adjust Extract Replace Remove Background Style Hybrid Action Overall
GPT-4o 4.61 4.33 2.9 4.35 3.66 4.57 4.93 3.96 4.89 4.2
MagicBrush 2.84 1.58 1.51 1.97 1.58 1.75 2.38 1.62 1.22 1.90
Instruct-P2P 2.45 1.83 1.44 2.01 1.50 1.44 3.55 1.2 1.46 1.88
AnyEdit 3.18 2.95 1.88 2.47 2.23 2.24 2.85 1.56 2.65 2.45
UltraEdit 3.44 2.81 2.13 2.96 1.45 2.83 3.76 1.91 2.98 2.7
OmniGen 3.47 3.04 1.71 2.94 2.43 3.21 4.19 2.24 3.38 2.96
Step1X-Edit 3.88 3.14 1.76 3.40 2.41 3.16 4.63 2.64 2.52 3.06
ICEdit 3.58 3.39 1.73 3.15 2.93 3.08 3.84 2.04 3.68 3.05
BAGEL 3.56 3.31 1.7 3.3 2.62 3.24 4.49 2.38 4.17 3.2
UniWorld-V1 3.82 3.64 2.27 3.47 3.24 2.99 4.21 2.96 2.74 3.26
OmniGen2 3.57 3.06 1.77 3.74 3.2 3.57 4.81 2.52 4.68 3.44
Ovis-U1 4.13 3.62 2.98 4.45 4.06 4.22 4.69 3.45 4.61 4.00

GEdit-Bench-EN

Model Background Change Color Alteration Material Modification Motion Change Portrait Beautification Style Transfer Subject Addition Subject Removal Subject Replacement Text Modification Tone Transformation Avg
GPT-4o 7.205 6.491 6.607 8.096 7.768 6.961 7.622 8.331 8.067 7.427 8.301 7.534
AnyEdit 4.663 4.260 2.537 2.024 3.479 2.032 3.995 3.089 3.180 0.922 5.151 3.212
Instruct-Pix2Pix 3.825 5.182 3.688 3.509 4.339 4.560 3.461 2.031 4.237 0.955 4.733 3.684
MagicBrush 5.637 5.136 5.078 4.513 4.487 4.439 5.252 3.704 4.941 1.384 5.130 4.518
OmniGen 5.281 6.003 5.308 2.916 3.087 4.903 6.628 6.352 5.616 4.519 5.064 5.062
Gemini 6.781 6.369 6.040 6.938 5.591 4.676 7.501 6.447 7.003 5.765 6.350 6.315
Step1X-Edit 6.547 6.545 6.204 6.483 6.787 7.221 6.975 6.512 7.068 6.921 6.448 6.701
Doubao 7.430 7.095 6.339 6.973 6.972 6.767 7.674 6.748 7.447 3.471 7.383 6.754
BAGEL 7.324 6.909 6.381 4.753 4.573 6.150 7.896 7.164 7.021 7.320 6.218 6.519
Ovis-U1 7.486 6.879 6.208 4.790 5.981 6.463 7.491 7.254 7.266 4.482 6.314 6.420

πŸ“š Citation

If you find Ovis-U1 useful, please cite our paper:

@inproceedings{wang2025ovisu1,
title={Ovis-U1 Technical Report},
author={Ovis Team},
year={2025}
}

πŸ™ Acknowledgments

The code is built upon Ovis and FLUX.

πŸ“„ License

The project is released under Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0, SPDX-License-identifier: Apache-2.0).

🚨 Disclaimer

We used compliance checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.