Update README.md
Browse files
README.md
CHANGED
@@ -120,20 +120,94 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
|
|
120 |
|
121 |
## Evaluation
|
122 |
|
123 |
-
The model was evaluated on the OpenLLM leaderboard tasks (
|
|
|
124 |
|
125 |
<details>
|
126 |
<summary>Evaluation details</summary>
|
127 |
|
|
|
128 |
```
|
129 |
lm_eval \
|
130 |
--model vllm \
|
131 |
-
--model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=
|
132 |
--tasks openllm \
|
133 |
--apply_chat_template\
|
134 |
--fewshot_as_multiturn \
|
135 |
--batch_size auto
|
136 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
137 |
</details>
|
138 |
|
139 |
### Accuracy
|
@@ -223,4 +297,140 @@ The model was evaluated on the OpenLLM leaderboard tasks (version 1), using [lm-
|
|
223 |
<td><strong>98.6%</strong>
|
224 |
</td>
|
225 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
226 |
</table>
|
|
|
120 |
|
121 |
## Evaluation
|
122 |
|
123 |
+
The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), and on reasoning tasks using [lighteval](https://github.com/neuralmagic/lighteval/tree/reasoning).
|
124 |
+
[vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
|
125 |
|
126 |
<details>
|
127 |
<summary>Evaluation details</summary>
|
128 |
|
129 |
+
**lm-evaluation-harness**
|
130 |
```
|
131 |
lm_eval \
|
132 |
--model vllm \
|
133 |
+
--model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
|
134 |
--tasks openllm \
|
135 |
--apply_chat_template\
|
136 |
--fewshot_as_multiturn \
|
137 |
--batch_size auto
|
138 |
```
|
139 |
+
|
140 |
+
```
|
141 |
+
lm_eval \
|
142 |
+
--model vllm \
|
143 |
+
--model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
|
144 |
+
--tasks mgsm \
|
145 |
+
--apply_chat_template\
|
146 |
+
--batch_size auto
|
147 |
+
```
|
148 |
+
|
149 |
+
```
|
150 |
+
lm_eval \
|
151 |
+
--model vllm \
|
152 |
+
--model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
|
153 |
+
--tasks leaderboard \
|
154 |
+
--apply_chat_template\
|
155 |
+
--fewshot_as_multiturn \
|
156 |
+
--batch_size auto
|
157 |
+
```
|
158 |
+
|
159 |
+
**lighteval**
|
160 |
+
|
161 |
+
lighteval_model_arguments.yaml
|
162 |
+
```yaml
|
163 |
+
model_parameters:
|
164 |
+
model_name: RedHatAI/Qwen3-1.7B-FP8-dynamic
|
165 |
+
dtype: auto
|
166 |
+
gpu_memory_utilization: 0.9
|
167 |
+
max_model_length: 40960
|
168 |
+
generation_parameters:
|
169 |
+
temperature: 0.6
|
170 |
+
top_k: 20
|
171 |
+
min_p: 0.0
|
172 |
+
top_p: 0.95
|
173 |
+
max_new_tokens: 32768
|
174 |
+
```
|
175 |
+
|
176 |
+
```
|
177 |
+
lighteval vllm \
|
178 |
+
--model_args lighteval_model_arguments.yaml \
|
179 |
+
--tasks lighteval|aime24|0|0 \
|
180 |
+
--use_chat_template = true
|
181 |
+
```
|
182 |
+
|
183 |
+
```
|
184 |
+
lighteval vllm \
|
185 |
+
--model_args lighteval_model_arguments.yaml \
|
186 |
+
--tasks lighteval|aime25|0|0 \
|
187 |
+
--use_chat_template = true
|
188 |
+
```
|
189 |
+
|
190 |
+
```
|
191 |
+
lighteval vllm \
|
192 |
+
--model_args lighteval_model_arguments.yaml \
|
193 |
+
--tasks lighteval|math_500|0|0 \
|
194 |
+
--use_chat_template = true
|
195 |
+
```
|
196 |
+
|
197 |
+
```
|
198 |
+
lighteval vllm \
|
199 |
+
--model_args lighteval_model_arguments.yaml \
|
200 |
+
--tasks lighteval|gpqa:diamond|0|0 \
|
201 |
+
--use_chat_template = true
|
202 |
+
```
|
203 |
+
|
204 |
+
```
|
205 |
+
lighteval vllm \
|
206 |
+
--model_args lighteval_model_arguments.yaml \
|
207 |
+
--tasks extended|lcb:codegeneration \
|
208 |
+
--use_chat_template = true
|
209 |
+
```
|
210 |
+
|
211 |
</details>
|
212 |
|
213 |
### Accuracy
|
|
|
297 |
<td><strong>98.6%</strong>
|
298 |
</td>
|
299 |
</tr>
|
300 |
+
<tr>
|
301 |
+
<td rowspan="7" ><strong>OpenLLM v2</strong>
|
302 |
+
</td>
|
303 |
+
<td>MMLU-Pro (5-shot)
|
304 |
+
</td>
|
305 |
+
<td>23.45
|
306 |
+
</td>
|
307 |
+
<td>21.38
|
308 |
+
</td>
|
309 |
+
<td>91.1%
|
310 |
+
</td>
|
311 |
+
</tr>
|
312 |
+
<tr>
|
313 |
+
<td>IFEval (0-shot)
|
314 |
+
</td>
|
315 |
+
<td>71.08
|
316 |
+
</td>
|
317 |
+
<td>70.93
|
318 |
+
</td>
|
319 |
+
<td>99.8%
|
320 |
+
</td>
|
321 |
+
</tr>
|
322 |
+
<tr>
|
323 |
+
<td>BBH (3-shot)
|
324 |
+
</td>
|
325 |
+
<td>7.13
|
326 |
+
</td>
|
327 |
+
<td>5.41
|
328 |
+
</td>
|
329 |
+
<td>---
|
330 |
+
</td>
|
331 |
+
</tr>
|
332 |
+
<tr>
|
333 |
+
<td>Math-lvl-5 (4-shot)
|
334 |
+
</td>
|
335 |
+
<td>35.91
|
336 |
+
</td>
|
337 |
+
<td>34.71
|
338 |
+
</td>
|
339 |
+
<td>96.7%
|
340 |
+
</td>
|
341 |
+
</tr>
|
342 |
+
<tr>
|
343 |
+
<td>GPQA (0-shot)
|
344 |
+
</td>
|
345 |
+
<td>0.11
|
346 |
+
</td>
|
347 |
+
<td>0.00
|
348 |
+
</td>
|
349 |
+
<td>---
|
350 |
+
</td>
|
351 |
+
</tr>
|
352 |
+
<tr>
|
353 |
+
<td>MuSR (0-shot)
|
354 |
+
</td>
|
355 |
+
<td>7.97
|
356 |
+
</td>
|
357 |
+
<td>7.18
|
358 |
+
</td>
|
359 |
+
<td>---
|
360 |
+
</td>
|
361 |
+
</tr>
|
362 |
+
<tr>
|
363 |
+
<td><strong>Average</strong>
|
364 |
+
</td>
|
365 |
+
<td><strong>24.28</strong>
|
366 |
+
</td>
|
367 |
+
<td><strong>23.27</strong>
|
368 |
+
</td>
|
369 |
+
<td><strong>95.8%</strong>
|
370 |
+
</td>
|
371 |
+
</tr>
|
372 |
+
<tr>
|
373 |
+
<td><strong>Multilingual</strong>
|
374 |
+
</td>
|
375 |
+
<td>MGSM (0-shot)
|
376 |
+
</td>
|
377 |
+
<td>22.10
|
378 |
+
</td>
|
379 |
+
<td>
|
380 |
+
</td>
|
381 |
+
<td>
|
382 |
+
</td>
|
383 |
+
</tr>
|
384 |
+
<tr>
|
385 |
+
<td rowspan="6" ><strong>Reasoning<br>(generation)</strong>
|
386 |
+
</td>
|
387 |
+
<td>AIME 2024
|
388 |
+
</td>
|
389 |
+
<td>43.96
|
390 |
+
</td>
|
391 |
+
<td>40.10
|
392 |
+
</td>
|
393 |
+
<td>91.2%
|
394 |
+
</td>
|
395 |
+
</tr>
|
396 |
+
<tr>
|
397 |
+
<td>AIME 2025
|
398 |
+
</td>
|
399 |
+
<td>32.29
|
400 |
+
</td>
|
401 |
+
<td>32.29
|
402 |
+
</td>
|
403 |
+
<td>100.0%
|
404 |
+
</td>
|
405 |
+
</tr>
|
406 |
+
<tr>
|
407 |
+
<td>GPQA diamond
|
408 |
+
</td>
|
409 |
+
<td>38.38
|
410 |
+
</td>
|
411 |
+
<td>38.89
|
412 |
+
</td>
|
413 |
+
<td>101.3%
|
414 |
+
</td>
|
415 |
+
</tr>
|
416 |
+
<tr>
|
417 |
+
<td>Math-lvl-5
|
418 |
+
</td>
|
419 |
+
<td>89.00
|
420 |
+
</td>
|
421 |
+
<td>88.80
|
422 |
+
</td>
|
423 |
+
<td>99.8%
|
424 |
+
</td>
|
425 |
+
</tr>
|
426 |
+
<tr>
|
427 |
+
<td>LiveCodeBench
|
428 |
+
</td>
|
429 |
+
<td>33.44
|
430 |
+
</td>
|
431 |
+
<td>
|
432 |
+
</td>
|
433 |
+
<td>
|
434 |
+
</td>
|
435 |
+
</tr>
|
436 |
</table>
|