nm-research commited on
Commit
da1d425
·
verified ·
1 Parent(s): b80ef25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +306 -3
README.md CHANGED
@@ -1,3 +1,306 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp4
4
+ - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: mit
16
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
17
+ ---
18
+
19
+ # DeepSeek-R1-Distill-Qwen-32B-NVFP4
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** DeepSeek-R1-Distill-Qwen-32B
23
+ - **Input:** Text / Image
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP4
27
+ - **Activation quantization:** FP4
28
+ - **Release Date:** 7/30/25
29
+ - **Version:** 1.0
30
+ - **Model Developers:** RedHatAI
31
+
32
+ This model is a quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B).
33
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.
34
+
35
+ ### Model Optimizations
36
+
37
+ This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) to FP4 data type, ready for inference with vLLM>=0.9.1
38
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
39
+
40
+ Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
41
+
42
+ ## Deployment
43
+
44
+ ### Use with vLLM
45
+
46
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
47
+ <details>
48
+ <summary>Model Usage Code</summary>
49
+
50
+ ```python
51
+ from vllm import LLM, SamplingParams
52
+ from transformers import AutoTokenizer
53
+
54
+ model_id = "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4"
55
+ number_gpus = 2
56
+
57
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
58
+
59
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
60
+
61
+ messages = [
62
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
63
+ {"role": "user", "content": "Who are you?"},
64
+ ]
65
+
66
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
67
+
68
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
69
+
70
+ outputs = llm.generate(prompts, sampling_params)
71
+
72
+ generated_text = outputs[0].outputs[0].text
73
+ print(generated_text)
74
+ ```
75
+ </details>
76
+
77
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
78
+
79
+ ## Creation
80
+
81
+ This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.
82
+
83
+ <details>
84
+ <summary>Model Creation Code</summary>
85
+
86
+ ```python
87
+
88
+ ```
89
+ </details>
90
+
91
+ ## Evaluation
92
+
93
+ This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
94
+ <table>
95
+ <thead>
96
+ <tr>
97
+ <th>Category</th>
98
+ <th>Metric</th>
99
+ <th>DeepSeek-R1-Distill-Qwen-32B</th>
100
+ <th>DeepSeek-R1-Distill-Qwen-32B-NVFP4</th>
101
+ <th>Recovery (%)</th>
102
+ </tr>
103
+ </thead>
104
+ <tbody>
105
+ <tr>
106
+ <td rowspan="7"><b>OpenLLM V1</b></td>
107
+ <td>ARC Challenge</td>
108
+ <td>67.66</td>
109
+ <td>64.25</td>
110
+ <td>94.94%</td>
111
+ </tr>
112
+ <tr>
113
+ <td>GSM8K</td>
114
+ <td>83.02</td>
115
+ <td>84.84</td>
116
+ <td>102.19%</td>
117
+ </tr>
118
+ <tr>
119
+ <td>Hellaswag</td>
120
+ <td>83.79</td>
121
+ <td>83.28</td>
122
+ <td>99.39%</td>
123
+ </tr>
124
+ <tr>
125
+ <td>MMLU</td>
126
+ <td>81.25</td>
127
+ <td>80.79</td>
128
+ <td>99.43%</td>
129
+ </tr>
130
+ <tr>
131
+ <td>TruthfulQA-mc2</td>
132
+ <td>58.37</td>
133
+ <td>57.50</td>
134
+ <td>98.51%</td>
135
+ </tr>
136
+ <tr>
137
+ <td>Winogrande</td>
138
+ <td>75.77</td>
139
+ <td>76.40</td>
140
+ <td>100.83%</td>
141
+ </tr>
142
+ <tr>
143
+ <td><b>Average</b></td>
144
+ <td><b>74.98</b></td>
145
+ <td><b>74.51</b></td>
146
+ <td><b>99.38%</b></td>
147
+ </tr>
148
+ <tr>
149
+ <td rowspan="7"><b>OpenLLM V2</b></td>
150
+ <td>MMLU-Pro</td>
151
+ <td></td>
152
+ <td></td>
153
+ <td>%</td>
154
+ </tr>
155
+ <tr>
156
+ <td>IFEval</td>
157
+ <td></td>
158
+ <td></td>
159
+ <td>%</td>
160
+ </tr>
161
+ <tr>
162
+ <td>BBH</td>
163
+ <td></td>
164
+ <td></td>
165
+ <td>%</td>
166
+ </tr>
167
+ <tr>
168
+ <td>Math-Hard</td>
169
+ <td></td>
170
+ <td></td>
171
+ <td>%</td>
172
+ </tr>
173
+ <tr>
174
+ <td>GPQA</td>
175
+ <td></td>
176
+ <td></td>
177
+ <td>%</td>
178
+ </tr>
179
+ <tr>
180
+ <td>MuSR</td>
181
+ <td></td>
182
+ <td></td>
183
+ <td>%</td>
184
+ </tr>
185
+ <tr>
186
+ <td><b>Average</b></td>
187
+ <td><b></b></td>
188
+ <td><b></b></td>
189
+ <td><b>%</b></td>
190
+ </tr>
191
+ <tr>
192
+ <td rowspan="4"><b>Reasoning</b></td>
193
+ <td>Math 500</td>
194
+ <td>95.09</td>
195
+ <td>95.60</td>
196
+ <td>100.54%</td>
197
+ </tr>
198
+ <tr>
199
+ <td>GPQA (diamond)</td>
200
+ <td>64.05</td>
201
+ <td>61.11</td>
202
+ <td>95.41%</td>
203
+ </tr>
204
+ <tr>
205
+ <td>AIME25</td>
206
+ <td>69.75 (AIME24)</td>
207
+ <td>53.33</td>
208
+ <td>76.45%</td>
209
+ </tr>
210
+ <tr>
211
+ <td>LCB: Code Generation</td>
212
+ <td>–</td>
213
+ <td>54.29</td>
214
+ <td>–</td>
215
+ </tr>
216
+ <tr>
217
+ <td rowspan="6"><b>Coding</b></td>
218
+ <td>HumanEval Instruct pass@1</td>
219
+ <td>–</td>
220
+ <td>–</td>
221
+ <td>–</td>
222
+ </tr>
223
+ <tr>
224
+ <td>HumanEval 64 Instruct pass@2</td>
225
+ <td>–</td>
226
+ <td>–</td>
227
+ <td>–</td>
228
+ </tr>
229
+ <tr>
230
+ <td>HumanEval 64 Instruct pass@8</td>
231
+ <td>–</td>
232
+ <td>–</td>
233
+ <td>–</td>
234
+ </tr>
235
+ <tr>
236
+ <td>HumanEval 64 Instruct pass@16</td>
237
+ <td>–</td>
238
+ <td>–</td>
239
+ <td>–</td>
240
+ </tr>
241
+ <tr>
242
+ <td>HumanEval 64 Instruct pass@32</td>
243
+ <td>–</td>
244
+ <td>–</td>
245
+ <td>–</td>
246
+ </tr>
247
+ <tr>
248
+ <td>HumanEval 64 Instruct pass@64</td>
249
+ <td>–</td>
250
+ <td>–</td>
251
+ <td>–</td>
252
+ </tr>
253
+ </tbody>
254
+ </table>
255
+
256
+
257
+ ### Reproduction
258
+
259
+ The results were obtained using the following commands:
260
+
261
+ <details>
262
+ <summary>Model Evaluation Commands</summary>
263
+
264
+ #### OpenLLM v1
265
+ ```
266
+ lm_eval \
267
+ --model vllm \
268
+ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
269
+ --apply_chat_template \
270
+ --fewshot_as_multiturn \
271
+ --tasks openllm \
272
+ --batch_size auto
273
+ ```
274
+
275
+
276
+ #### OpenLLM v2
277
+ ```
278
+ lm_eval \
279
+ --model vllm \
280
+ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=15000,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
281
+ --apply_chat_template \
282
+ --fewshot_as_multiturn \
283
+ --tasks leaderboard \
284
+ --batch_size auto
285
+ ```
286
+
287
+ #### HumanEval and HumanEval_64
288
+ ```
289
+ lm_eval \
290
+ --model vllm \
291
+ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
292
+ --apply_chat_template \
293
+ --fewshot_as_multiturn \
294
+ --tasks humaneval_instruct \
295
+ --batch_size auto
296
+
297
+
298
+ lm_eval \
299
+ --model vllm \
300
+ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
301
+ --apply_chat_template \
302
+ --fewshot_as_multiturn \
303
+ --tasks humaneval_64_instruct \
304
+ --batch_size auto
305
+ ```
306
+ </details>