yonghenglh6 commited on
Commit
dd01288
·
verified ·
1 Parent(s): b4e9f55

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. .gitattributes +1 -0
  2. README.md +179 -3
  3. figures/logo.png +0 -0
  4. figures/wechat.png +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ figures/wechat.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,179 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - multimodal
7
+ - vision-language
8
+ - chat
9
+ library_name: transformers
10
+ language:
11
+ - en
12
+ - zh
13
+ ---
14
+ # dots.vlm1
15
+
16
+ <p align="center">
17
+ <img src="figures/logo.png" width="300"/>
18
+ <p>
19
+ <p align="center">
20
+ &nbsp&nbsp🤗 <a href="https://huggingface.co/rednote-hilab/dots.vlm1">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📄 <a href="https://github.com/rednote-hilab/dots.vlm1/blob/main/assets/blog.md">Blog</a> &nbsp&nbsp
21
+ <br>
22
+ 🖥️ <a href="https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo">Demo</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://raw.githubusercontent.com/rednote-hilab/dots.vlm1/master/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp📕 <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c">rednote</a>&nbsp&nbsp
23
+ </p>
24
+
25
+ Visit our Hugging Face (click links above) or check out our [live demo](https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo) to try dots.vlm1! Enjoy!
26
+
27
+ ## Introduction
28
+
29
+ We are excited to introduce **dots.vlm1**, the first vision-language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), **dots.vlm1** demonstrates strong multimodal understanding and reasoning capabilities.
30
+
31
+ Through large-scale pretraining and carefully tuned post-training, **dots.vlm1 achieves near state-of-the-art performance in both visual perception and reasoning**, setting a new performance ceiling for open-source vision-language models—while still maintaining competitive capabilities in pure-text tasks.
32
+
33
+ ## Model Summary
34
+
35
+ **This repo contains the instruction-tuned `dots.vlm1` model** which has the following features:
36
+
37
+ - Type: A multimodal vision-language model with 1.2B vision encoder and DeepSeek V3 LLM
38
+ - Training Stages: Vision encoder pretraining, VLM pretraining, and supervised fine-tuning (SFT)
39
+ - Architecture: NaViT vision encoder + MLP adapter + DeepSeek V3 MoE language model
40
+ - Vision Encoder: 1.2B parameters, 42 transformer layers with RMSNorm, SwiGLU, and 2D RoPE
41
+ - Supported Languages: English, Chinese
42
+ - Context Length: 65,536 tokens
43
+ - License: MIT
44
+
45
+ **Model Highlights**:
46
+ - **NaViT Vision Encoder**: Trained entirely from scratch rather than fine-tuning an existing vision backbone. It natively supports dynamic resolution and incorporates pure visual supervision in addition to traditional text supervision, thereby enhancing the upper bound of perceptual capacity. Beyond image captioning datasets, a large amount of structured image data was introduced during pretraining to improve the model's perceptual capabilities—particularly for tasks such as OCR.
47
+ - **Multimodal Training Data**: In addition to conventional approaches, dots.vlm1 leverages a wide range of synthetic data strategies to cover diverse image types (e.g., tables, charts, documents, graphics) and descriptions (e.g., alt text, dense captions, grounding annotations). Furthermore, a strong multimodal model was used to rewrite web page data with interleaved text and images, significantly improving the quality of the training corpus.
48
+
49
+ ## Example Usage
50
+
51
+ ### Environment Setup
52
+
53
+ You have two options to set up the environment:
54
+
55
+ #### Option 1: Using Base Image + Manual Installation
56
+ ```bash
57
+ # Use the base SGLang image
58
+ docker run -it --gpus all lmsysorg/sglang:v0.4.9.post1-cu126
59
+
60
+ # Clone and install our custom SGLang branch
61
+ # IMPORTANT: Only our specific SGLang version supports dots.vlm1 models
62
+ # We have submitted a PR to the main SGLang repository (currently under review):
63
+ # https://github.com/sgl-project/sglang/pull/8778
64
+ git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
65
+ pip install -e sglang/python
66
+ ```
67
+
68
+ #### Option 2: Using Pre-built Image (Recommended)
69
+ ```bash
70
+ # Use our pre-built image with dots.vlm1 support
71
+ docker run -it --gpus all rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126
72
+ ```
73
+
74
+ ### Multi-Node Deployment
75
+
76
+ Our model supports distributed deployment across multiple machines. Here's how to set up a 2-node cluster:
77
+
78
+ **Prerequisites:**
79
+ - Model: `rednote-hilab/dots.vlm1.inst`
80
+ - Node 1 IP: `10.0.0.1` (master node)
81
+ - Node 2 IP: `10.0.0.2` (worker node)
82
+
83
+ #### Node 1 (Master - rank 0):
84
+ ```bash
85
+ export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
86
+
87
+ python3 -m sglang.launch_server \
88
+ --model-path $HF_MODEL_PATH \
89
+ --tp 16 \
90
+ --dist-init-addr 10.0.0.1:23456 \
91
+ --nnodes 2 \
92
+ --node-rank 0 \
93
+ --trust-remote-code \
94
+ --host 0.0.0.0 \
95
+ --port 15553 \
96
+ --context-length 65536 \
97
+ --max-running-requests 64 \
98
+ --disable-radix-cache \
99
+ --mem-fraction-static 0.8 \
100
+ --chunked-prefill-size -1 \
101
+ --chat-template dots-vlm \
102
+ --cuda-graph-max-bs 64 \
103
+ --quantization fp8
104
+ ```
105
+
106
+ #### Node 2 (Worker - rank 1):
107
+ ```bash
108
+ export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
109
+
110
+ python3 -m sglang.launch_server \
111
+ --model-path $HF_MODEL_PATH \
112
+ --tp 16 \
113
+ --dist-init-addr 10.0.0.1:23456 \
114
+ --nnodes 2 \
115
+ --node-rank 1 \
116
+ --trust-remote-code \
117
+ --host 0.0.0.0 \
118
+ --port 15553 \
119
+ --context-length 65536 \
120
+ --max-running-requests 64 \
121
+ --disable-radix-cache \
122
+ --mem-fraction-static 0.8 \
123
+ --chunked-prefill-size -1 \
124
+ --chat-template dots-vlm \
125
+ --cuda-graph-max-bs 64 \
126
+ --quantization fp8
127
+ ```
128
+
129
+ ### Configuration Parameters
130
+
131
+ Key parameters explanation:
132
+ - `--tp 16`: Tensor parallelism across 16 GPUs per node
133
+ - `--nnodes 2`: Total number of nodes in the cluster
134
+ - `--node-rank`: Node identifier (0 for master, 1+ for workers)
135
+ - `--context-length 65536`: Maximum context length
136
+ - `--quantization fp8`: Use FP8 quantization for efficiency
137
+ - `--chat-template dots-vlm`: Use custom chat template for dots.vlm model
138
+
139
+ ### API Usage
140
+
141
+ Once the server is launched, you can access the model through OpenAI-compatible API:
142
+
143
+ ```bash
144
+ curl -X POST http://localhost:15553/v1/chat/completions \
145
+ -H "Content-Type: application/json" \
146
+ -d '{
147
+ "model": "model",
148
+ "messages": [
149
+ {
150
+ "role": "user",
151
+ "content": [
152
+ {
153
+ "type": "text",
154
+ "text": "Please briefly describe this image"
155
+ },
156
+ {
157
+ "type": "image_url",
158
+ "image_url": {
159
+ "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
160
+ }
161
+ }
162
+ ]
163
+ }
164
+ ],
165
+ "temperature": 0.1,
166
+ "top_p": 0.9,
167
+ "max_tokens": 55000
168
+ }'
169
+ ```
170
+
171
+ ## Performance
172
+
173
+ On major visual benchmarks, dots.vlm1 has achieved overall performance comparable to leading models such as **Gemini 2.5 Pro** and **Seed-VL1.5 thinking**. In particular, it demonstrates strong visual-text understanding and reasoning capabilities on datasets like **MMMU**, **MathVision**, and **OCR Reasoning**.
174
+
175
+ For typical text-based reasoning tasks (e.g., **AIME**, **GPQA**, **LiveCodeBench**), **dots.vlm1** performs roughly on par with **DeepSeek-R1-0528**, showing competitive general capability in mathematics and coding.
176
+
177
+ Overall, **dots.vlm1** approaches state-of-the-art levels in multimodal visual understanding and achieves mainstream performance in text reasoning.
178
+
179
+ Detailed evaluation results are available in our [blog](https://github.com/rednote-hilab/dots.vlm1/blob/main/assets/blog.md).
figures/logo.png ADDED
figures/wechat.png ADDED

Git LFS Details

  • SHA256: d609b22af97306adeb57a76a72626c64addca42a259a128d7d1522384741d298
  • Pointer size: 131 Bytes
  • Size of remote file: 267 kB