Upload folder using huggingface_hub
Browse files- .gitattributes +1 -0
- README.md +179 -3
- figures/logo.png +0 -0
- figures/wechat.png +3 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
figures/wechat.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,179 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
|
4 |
+
pipeline_tag: image-text-to-text
|
5 |
+
tags:
|
6 |
+
- multimodal
|
7 |
+
- vision-language
|
8 |
+
- chat
|
9 |
+
library_name: transformers
|
10 |
+
language:
|
11 |
+
- en
|
12 |
+
- zh
|
13 |
+
---
|
14 |
+
# dots.vlm1
|
15 |
+
|
16 |
+
<p align="center">
|
17 |
+
<img src="figures/logo.png" width="300"/>
|
18 |
+
<p>
|
19 |
+
<p align="center">
|
20 |
+
  🤗 <a href="https://huggingface.co/rednote-hilab/dots.vlm1">Hugging Face</a>   |    📄 <a href="https://github.com/rednote-hilab/dots.vlm1/blob/main/assets/blog.md">Blog</a>   
|
21 |
+
<br>
|
22 |
+
🖥️ <a href="https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo">Demo</a>   |   💬 <a href="https://raw.githubusercontent.com/rednote-hilab/dots.vlm1/master/assets/wechat.png">WeChat (微信)</a>   |   📕 <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c">rednote</a>  
|
23 |
+
</p>
|
24 |
+
|
25 |
+
Visit our Hugging Face (click links above) or check out our [live demo](https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo) to try dots.vlm1! Enjoy!
|
26 |
+
|
27 |
+
## Introduction
|
28 |
+
|
29 |
+
We are excited to introduce **dots.vlm1**, the first vision-language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), **dots.vlm1** demonstrates strong multimodal understanding and reasoning capabilities.
|
30 |
+
|
31 |
+
Through large-scale pretraining and carefully tuned post-training, **dots.vlm1 achieves near state-of-the-art performance in both visual perception and reasoning**, setting a new performance ceiling for open-source vision-language models—while still maintaining competitive capabilities in pure-text tasks.
|
32 |
+
|
33 |
+
## Model Summary
|
34 |
+
|
35 |
+
**This repo contains the instruction-tuned `dots.vlm1` model** which has the following features:
|
36 |
+
|
37 |
+
- Type: A multimodal vision-language model with 1.2B vision encoder and DeepSeek V3 LLM
|
38 |
+
- Training Stages: Vision encoder pretraining, VLM pretraining, and supervised fine-tuning (SFT)
|
39 |
+
- Architecture: NaViT vision encoder + MLP adapter + DeepSeek V3 MoE language model
|
40 |
+
- Vision Encoder: 1.2B parameters, 42 transformer layers with RMSNorm, SwiGLU, and 2D RoPE
|
41 |
+
- Supported Languages: English, Chinese
|
42 |
+
- Context Length: 65,536 tokens
|
43 |
+
- License: MIT
|
44 |
+
|
45 |
+
**Model Highlights**:
|
46 |
+
- **NaViT Vision Encoder**: Trained entirely from scratch rather than fine-tuning an existing vision backbone. It natively supports dynamic resolution and incorporates pure visual supervision in addition to traditional text supervision, thereby enhancing the upper bound of perceptual capacity. Beyond image captioning datasets, a large amount of structured image data was introduced during pretraining to improve the model's perceptual capabilities—particularly for tasks such as OCR.
|
47 |
+
- **Multimodal Training Data**: In addition to conventional approaches, dots.vlm1 leverages a wide range of synthetic data strategies to cover diverse image types (e.g., tables, charts, documents, graphics) and descriptions (e.g., alt text, dense captions, grounding annotations). Furthermore, a strong multimodal model was used to rewrite web page data with interleaved text and images, significantly improving the quality of the training corpus.
|
48 |
+
|
49 |
+
## Example Usage
|
50 |
+
|
51 |
+
### Environment Setup
|
52 |
+
|
53 |
+
You have two options to set up the environment:
|
54 |
+
|
55 |
+
#### Option 1: Using Base Image + Manual Installation
|
56 |
+
```bash
|
57 |
+
# Use the base SGLang image
|
58 |
+
docker run -it --gpus all lmsysorg/sglang:v0.4.9.post1-cu126
|
59 |
+
|
60 |
+
# Clone and install our custom SGLang branch
|
61 |
+
# IMPORTANT: Only our specific SGLang version supports dots.vlm1 models
|
62 |
+
# We have submitted a PR to the main SGLang repository (currently under review):
|
63 |
+
# https://github.com/sgl-project/sglang/pull/8778
|
64 |
+
git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
|
65 |
+
pip install -e sglang/python
|
66 |
+
```
|
67 |
+
|
68 |
+
#### Option 2: Using Pre-built Image (Recommended)
|
69 |
+
```bash
|
70 |
+
# Use our pre-built image with dots.vlm1 support
|
71 |
+
docker run -it --gpus all rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126
|
72 |
+
```
|
73 |
+
|
74 |
+
### Multi-Node Deployment
|
75 |
+
|
76 |
+
Our model supports distributed deployment across multiple machines. Here's how to set up a 2-node cluster:
|
77 |
+
|
78 |
+
**Prerequisites:**
|
79 |
+
- Model: `rednote-hilab/dots.vlm1.inst`
|
80 |
+
- Node 1 IP: `10.0.0.1` (master node)
|
81 |
+
- Node 2 IP: `10.0.0.2` (worker node)
|
82 |
+
|
83 |
+
#### Node 1 (Master - rank 0):
|
84 |
+
```bash
|
85 |
+
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
|
86 |
+
|
87 |
+
python3 -m sglang.launch_server \
|
88 |
+
--model-path $HF_MODEL_PATH \
|
89 |
+
--tp 16 \
|
90 |
+
--dist-init-addr 10.0.0.1:23456 \
|
91 |
+
--nnodes 2 \
|
92 |
+
--node-rank 0 \
|
93 |
+
--trust-remote-code \
|
94 |
+
--host 0.0.0.0 \
|
95 |
+
--port 15553 \
|
96 |
+
--context-length 65536 \
|
97 |
+
--max-running-requests 64 \
|
98 |
+
--disable-radix-cache \
|
99 |
+
--mem-fraction-static 0.8 \
|
100 |
+
--chunked-prefill-size -1 \
|
101 |
+
--chat-template dots-vlm \
|
102 |
+
--cuda-graph-max-bs 64 \
|
103 |
+
--quantization fp8
|
104 |
+
```
|
105 |
+
|
106 |
+
#### Node 2 (Worker - rank 1):
|
107 |
+
```bash
|
108 |
+
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
|
109 |
+
|
110 |
+
python3 -m sglang.launch_server \
|
111 |
+
--model-path $HF_MODEL_PATH \
|
112 |
+
--tp 16 \
|
113 |
+
--dist-init-addr 10.0.0.1:23456 \
|
114 |
+
--nnodes 2 \
|
115 |
+
--node-rank 1 \
|
116 |
+
--trust-remote-code \
|
117 |
+
--host 0.0.0.0 \
|
118 |
+
--port 15553 \
|
119 |
+
--context-length 65536 \
|
120 |
+
--max-running-requests 64 \
|
121 |
+
--disable-radix-cache \
|
122 |
+
--mem-fraction-static 0.8 \
|
123 |
+
--chunked-prefill-size -1 \
|
124 |
+
--chat-template dots-vlm \
|
125 |
+
--cuda-graph-max-bs 64 \
|
126 |
+
--quantization fp8
|
127 |
+
```
|
128 |
+
|
129 |
+
### Configuration Parameters
|
130 |
+
|
131 |
+
Key parameters explanation:
|
132 |
+
- `--tp 16`: Tensor parallelism across 16 GPUs per node
|
133 |
+
- `--nnodes 2`: Total number of nodes in the cluster
|
134 |
+
- `--node-rank`: Node identifier (0 for master, 1+ for workers)
|
135 |
+
- `--context-length 65536`: Maximum context length
|
136 |
+
- `--quantization fp8`: Use FP8 quantization for efficiency
|
137 |
+
- `--chat-template dots-vlm`: Use custom chat template for dots.vlm model
|
138 |
+
|
139 |
+
### API Usage
|
140 |
+
|
141 |
+
Once the server is launched, you can access the model through OpenAI-compatible API:
|
142 |
+
|
143 |
+
```bash
|
144 |
+
curl -X POST http://localhost:15553/v1/chat/completions \
|
145 |
+
-H "Content-Type: application/json" \
|
146 |
+
-d '{
|
147 |
+
"model": "model",
|
148 |
+
"messages": [
|
149 |
+
{
|
150 |
+
"role": "user",
|
151 |
+
"content": [
|
152 |
+
{
|
153 |
+
"type": "text",
|
154 |
+
"text": "Please briefly describe this image"
|
155 |
+
},
|
156 |
+
{
|
157 |
+
"type": "image_url",
|
158 |
+
"image_url": {
|
159 |
+
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
160 |
+
}
|
161 |
+
}
|
162 |
+
]
|
163 |
+
}
|
164 |
+
],
|
165 |
+
"temperature": 0.1,
|
166 |
+
"top_p": 0.9,
|
167 |
+
"max_tokens": 55000
|
168 |
+
}'
|
169 |
+
```
|
170 |
+
|
171 |
+
## Performance
|
172 |
+
|
173 |
+
On major visual benchmarks, dots.vlm1 has achieved overall performance comparable to leading models such as **Gemini 2.5 Pro** and **Seed-VL1.5 thinking**. In particular, it demonstrates strong visual-text understanding and reasoning capabilities on datasets like **MMMU**, **MathVision**, and **OCR Reasoning**.
|
174 |
+
|
175 |
+
For typical text-based reasoning tasks (e.g., **AIME**, **GPQA**, **LiveCodeBench**), **dots.vlm1** performs roughly on par with **DeepSeek-R1-0528**, showing competitive general capability in mathematics and coding.
|
176 |
+
|
177 |
+
Overall, **dots.vlm1** approaches state-of-the-art levels in multimodal visual understanding and achieves mainstream performance in text reasoning.
|
178 |
+
|
179 |
+
Detailed evaluation results are available in our [blog](https://github.com/rednote-hilab/dots.vlm1/blob/main/assets/blog.md).
|
figures/logo.png
ADDED
![]() |
figures/wechat.png
ADDED
![]() |
Git LFS Details
|