BabyChou commited on
Commit
c4c8050
Β·
verified Β·
1 Parent(s): 92c8e0c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -199
README.md CHANGED
@@ -2,104 +2,11 @@
2
  license: other
3
  license_name: yi-license
4
  license_link: LICENSE
5
- library_name: pytorch
6
  ---
7
 
8
- <div align="center">
9
-
10
- <picture>
11
- <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_dark.svg" width="200px">
12
- <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_light.svg" width="200px">
13
- <img alt="specify theme context for images" src="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_light.svg" width="200px">
14
- </picture>
15
-
16
- </div>
17
-
18
- <div align="center">
19
- <h1 align="center">Yi Vision Language Model</h1>
20
- </div>
21
-
22
-
23
- <div align="center">
24
- <h3 align="center">Better Bilingual Multimodal Model</h3>
25
- </div>
26
-
27
- <p align="center">
28
- πŸ€— <a href="https://huggingface.co/01-ai" target="_blank">Hugging Face</a> β€’ πŸ€– <a href="https://www.modelscope.cn/organization/01ai/" target="_blank">ModelScope</a> β€’ ✑️ <a href="https://wisemodel.cn/organization/01.AI" target="_blank">WiseModel</a>
29
- </p>
30
-
31
- <p align="center">
32
- πŸ‘©β€πŸš€ Ask questions or discuss ideas on <a href="https://github.com/01-ai/Yi/discussions" target="_blank"> GitHub </a>!
33
- </p>
34
-
35
- <p align="center">
36
- πŸ‘‹ Join us πŸ’¬ <a href="https://github.com/01-ai/Yi/issues/43#issuecomment-1827285245" target="_blank"> WeChat (Chinese) </a>!
37
- </p>
38
-
39
- <p align="center">
40
- πŸ“š Grow at <a href="https://github.com/01-ai/Yi/blob/main/docs/learning_hub.md"> Yi Learning Hub </a>!
41
- </p>
42
-
43
- <hr>
44
-
45
- <!-- DO NOT REMOVE ME -->
46
-
47
- <details open>
48
- <summary></b>πŸ“• Table of Contents</b></summary>
49
-
50
- - [What is Yi-VL?](#what-is-yi-vl)
51
- - [Overview](#overview)
52
- - [Models](#models)
53
- - [Features](#features)
54
- - [Architecture](#architecture)
55
- - [Training](#training)
56
- - [Limitations](#limitations)
57
- - [Why Yi-VL?](#why-yi-vl)
58
- - [Benchmarks](#benchmarks)
59
- - [Showcases](#showcases)
60
- - [How to use Yi-VL?](#how-to-use-yi-vl)
61
- - [Quick start](#quick-start)
62
- - [Hardware requirements](#hardware-requirements)
63
- - [Misc.](#misc)
64
- - [Acknowledgements and attributions](#acknowledgements-and-attributions)
65
- - [List of used open-source projects](#list-of-used-open-source-projects)
66
- - [License](#license)
67
-
68
- </details>
69
-
70
- <hr>
71
 
72
  # What is Yi-VL?
73
 
74
- ## Overview
75
-
76
- - **Yi Vision Language (Yi-VL)** model is the open-source, multimodal version of the Yi **Large Language Model (LLM)** series, enabling content comprehension, recognition, and multi-round conversations about images.
77
-
78
- - Yi-VL demonstrates exceptional performance, **ranking first** among all existing open-source models in the latest benchmarks including [MMMU](https://mmmu-benchmark.github.io/#leaderboard) in English and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard) in Chinese (based on data available up to January 2024).
79
-
80
- - Yi-VL-34B is the **first** open-source 34B vision language model worldwide.
81
-
82
- ## Models
83
-
84
- Yi-VL has released the following versions.
85
-
86
- Model | Download
87
- |---|---
88
- Yi-VL-34B |β€’ [πŸ€— Hugging Face](https://huggingface.co/01-ai/Yi-VL-34B) β€’ [πŸ€– ModelScope](https://www.modelscope.cn/models/01ai/Yi-VL-34B/summary)
89
- Yi-VL-6B | β€’ [πŸ€— Hugging Face](https://huggingface.co/01-ai/Yi-VL-6B) β€’ [πŸ€– ModelScope](https://www.modelscope.cn/models/01ai/Yi-VL-6B/summary)
90
-
91
- ## Features
92
-
93
- Yi-VL offers the following features:
94
-
95
- - Multi-round text-image conversations: Yi-VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
96
-
97
- - Bilingual text support: Yi-VL supports conversations in both English and Chinese, including text recognition in images.
98
-
99
- - Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
100
-
101
- - Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448&times;448.
102
-
103
  ## Architecture
104
 
105
  Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
@@ -112,121 +19,33 @@ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, whi
112
 
113
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
114
 
115
- ## Training
116
-
117
- ### Training process
118
-
119
- Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
120
-
121
- - Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224&times;224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs from [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/). The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
122
-
123
- - Stage 2: The image resolution of ViT is scaled up to 448&times;448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs, such as [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/), [CLLaVA](https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions), [LLaVAR](https://llavar.github.io/), [Flickr](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset), [VQAv2](https://paperswithcode.com/dataset/visual-question-answering-v2-0), [RefCOCO](https://github.com/lichengunc/refer/tree/master), [Visual7w](http://ai.stanford.edu/~yukez/visual7w/) and so on.
124
-
125
- - Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including [GQA](https://cs.stanford.edu/people/dorarad/gqa/download.html), [VizWiz VQA](https://vizwiz.org/tasks-and-datasets/vqa/), [TextCaps](https://opendatalab.com/OpenDataLab/TextCaps), [OCR-VQA](https://ocr-vqa.github.io/), [Visual Genome](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html), [LAION GPT4V](https://huggingface.co/datasets/laion/gpt4v-dataset) and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
126
-
127
- Below are the parameters configured for each stage.
128
-
129
- Stage | Global batch size | Learning rate | Gradient clip | Epochs
130
- |---|---|---|---|---
131
- Stage 1, 2 |4096|1e-4|0.5|1
132
- Stage 3|256|2e-5|1.0|2
133
-
134
- ### Training resource consumption
135
-
136
- - The training consumes 128 NVIDIA A800 (80G) GPUs.
137
-
138
- - The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
139
-
140
- ## Limitations
141
-
142
- This is the initial release of the Yi-VL, which comes with some known limitations. It is recommended to carefully evaluate potential risks before adopting any models.
143
-
144
- - Feature limitation
145
-
146
- - Visual question answering is supported. Other features like text-to-3D and image-to-video are not yet supported.
147
-
148
- - A single image rather than several images can be accepted as an input.
149
-
150
- - Hallucination problem
151
-
152
- - There is a certain possibility of generating content that does not exist in the image.
153
-
154
- - In scenes containing multiple objects, some objects might be incorrectly identified or described with insufficient detail.
155
-
156
- - Resolution issue
157
-
158
- - Yi-VL is trained on images with a resolution of 448&times;448. During inference, inputs of any resolution are resized to 448&times;448. Low-resolution images may result in information loss, and more fine-grained images (above 448) do not bring in extra knowledge.
159
-
160
- - Other limitations of the Yi LLM.
161
-
162
- # Why Yi-VL?
163
-
164
- ## Benchmarks
165
-
166
- Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchmark.github.io) and [CMMMU](https://cmmmu-benchmark.github.io), two advanced benchmarks that include massive multi-discipline multimodal questions (based on data available up to January 2024).
167
-
168
- - MMMU
169
-
170
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/kCmXuwLbLvequ93kjh3mg.png)
171
-
172
- - CMMMU
173
-
174
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/6YuSakMCg3D2AozixdoZ0.png)
175
-
176
- ## Showcases
177
-
178
- Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.
179
-
180
- - English
181
-
182
-
183
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64cc65d786d8dc0caa6ab3cd/F_2bIVwMtVamygbVqtb8E.png)
184
-
185
- - Chinese
186
-
187
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/l_tLzugFtHk1dkVsFJE7B.png)
188
 
189
  # How to use Yi-VL?
190
 
191
  ## Quick start
192
 
193
- Please refer to [Yi GitHub Repo](https://github.com/01-ai/Yi/tree/main/VL) for details.
194
-
195
- ## Hardware requirements
196
 
197
- For model inference, the recommended GPU examples are:
 
 
 
198
 
199
- - Yi-VL-6B: RTX 3090, RTX 4090, A10, A30
200
 
201
- - Yi-VL-34B: 4 &times; RTX 4090, A800 (80 GB)
 
 
202
 
203
- # Misc.
204
-
205
- ## Acknowledgements and attributions
206
-
207
- This project makes use of open-source software/components. We acknowledge and are grateful to these developers for their contributions to the open-source community.
208
-
209
- ### List of used open-source projects
210
-
211
- 1. LLaVA
212
- - Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yuheng Li, and Yong Jae Lee
213
- - Source: https://github.com/haotian-liu/LLaVA
214
- - License: Apache-2.0 license
215
- - Description: The codebase is based on LLaVA code.
216
-
217
- 2. OpenClip
218
- - Authors: Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt
219
- - Source: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
220
- - License: MIT
221
- - Description: The ViT is initialized using the weights of OpenClip.
222
-
223
- **Notes**
224
-
225
- - This attribution does not claim to cover all open-source components used. Please check individual components and their respective licenses for full details.
226
-
227
- - The use of the open-source components is subject to the terms and conditions of the respective licenses.
228
 
229
- We appreciate the open-source community for their invaluable contributions to the technology world.
 
 
 
 
 
 
230
 
231
  ## License
232
 
@@ -236,4 +55,4 @@ The Yi series models are fully open for academic research and free for commercia
236
 
237
  All usage must adhere to the [Yi Series Models Community License Agreement 2.1](https://huggingface.co/01-ai/Yi-VL-34B/blob/main/LICENSE).
238
 
239
- For free commercial use, you only need to send an email to get official commercial permission.
 
2
  license: other
3
  license_name: yi-license
4
  license_link: LICENSE
 
5
  ---
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  # What is Yi-VL?
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ## Architecture
11
 
12
  Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
 
19
 
20
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  # How to use Yi-VL?
24
 
25
  ## Quick start
26
 
27
+ This has been implemented into the SGLang codebase, where you can simply call this model by creating a function like so:
28
+ ```
29
+ import sglang as sgl
30
 
31
+ @sgl.function
32
+ def image_qa(s, image_path, question):
33
+ s += sgl.user(sgl.image(image_path) + question)
34
+ s += sgl.assistant(sgl.gen("answer"))
35
 
 
36
 
37
+ runtime = sgl.Runtime(model_path="BabyChou/Yi-VL-34B",
38
+ tokenizer_path="BabyChou/Yi-VL-34B")
39
+ sgl.set_default_backend(runtime)
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
+ # Single
43
+ state = image_qa.run(
44
+ image_path="images/cat.jpeg",
45
+ question="What is this?",
46
+ max_new_tokens=64)
47
+ print(state["answer"], "\n")
48
+ ```
49
 
50
  ## License
51
 
 
55
 
56
  All usage must adhere to the [Yi Series Models Community License Agreement 2.1](https://huggingface.co/01-ai/Yi-VL-34B/blob/main/LICENSE).
57
 
58
+ For free commercial use, you only need to send an email to get official commercial permission.