alexmarques commited on
Commit
e776dcc
·
verified ·
1 Parent(s): 3d4d6c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -2
README.md CHANGED
@@ -120,20 +120,94 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
120
 
121
  ## Evaluation
122
 
123
- The model was evaluated on the OpenLLM leaderboard tasks (version 1), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [vLLM](https://docs.vllm.ai/en/stable/).
 
124
 
125
  <details>
126
  <summary>Evaluation details</summary>
127
 
 
128
  ```
129
  lm_eval \
130
  --model vllm \
131
- --model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
132
  --tasks openllm \
133
  --apply_chat_template\
134
  --fewshot_as_multiturn \
135
  --batch_size auto
136
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  </details>
138
 
139
  ### Accuracy
@@ -223,4 +297,140 @@ The model was evaluated on the OpenLLM leaderboard tasks (version 1), using [lm-
223
  <td><strong>98.6%</strong>
224
  </td>
225
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  </table>
 
120
 
121
  ## Evaluation
122
 
123
+ The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), and on reasoning tasks using [lighteval](https://github.com/neuralmagic/lighteval/tree/reasoning).
124
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
125
 
126
  <details>
127
  <summary>Evaluation details</summary>
128
 
129
+ **lm-evaluation-harness**
130
  ```
131
  lm_eval \
132
  --model vllm \
133
+ --model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
134
  --tasks openllm \
135
  --apply_chat_template\
136
  --fewshot_as_multiturn \
137
  --batch_size auto
138
  ```
139
+
140
+ ```
141
+ lm_eval \
142
+ --model vllm \
143
+ --model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
144
+ --tasks mgsm \
145
+ --apply_chat_template\
146
+ --batch_size auto
147
+ ```
148
+
149
+ ```
150
+ lm_eval \
151
+ --model vllm \
152
+ --model_args pretrained="RedHatAI/Qwen3-1.7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
153
+ --tasks leaderboard \
154
+ --apply_chat_template\
155
+ --fewshot_as_multiturn \
156
+ --batch_size auto
157
+ ```
158
+
159
+ **lighteval**
160
+
161
+ lighteval_model_arguments.yaml
162
+ ```yaml
163
+ model_parameters:
164
+ model_name: RedHatAI/Qwen3-1.7B-FP8-dynamic
165
+ dtype: auto
166
+ gpu_memory_utilization: 0.9
167
+ max_model_length: 40960
168
+ generation_parameters:
169
+ temperature: 0.6
170
+ top_k: 20
171
+ min_p: 0.0
172
+ top_p: 0.95
173
+ max_new_tokens: 32768
174
+ ```
175
+
176
+ ```
177
+ lighteval vllm \
178
+ --model_args lighteval_model_arguments.yaml \
179
+ --tasks lighteval|aime24|0|0 \
180
+ --use_chat_template = true
181
+ ```
182
+
183
+ ```
184
+ lighteval vllm \
185
+ --model_args lighteval_model_arguments.yaml \
186
+ --tasks lighteval|aime25|0|0 \
187
+ --use_chat_template = true
188
+ ```
189
+
190
+ ```
191
+ lighteval vllm \
192
+ --model_args lighteval_model_arguments.yaml \
193
+ --tasks lighteval|math_500|0|0 \
194
+ --use_chat_template = true
195
+ ```
196
+
197
+ ```
198
+ lighteval vllm \
199
+ --model_args lighteval_model_arguments.yaml \
200
+ --tasks lighteval|gpqa:diamond|0|0 \
201
+ --use_chat_template = true
202
+ ```
203
+
204
+ ```
205
+ lighteval vllm \
206
+ --model_args lighteval_model_arguments.yaml \
207
+ --tasks extended|lcb:codegeneration \
208
+ --use_chat_template = true
209
+ ```
210
+
211
  </details>
212
 
213
  ### Accuracy
 
297
  <td><strong>98.6%</strong>
298
  </td>
299
  </tr>
300
+ <tr>
301
+ <td rowspan="7" ><strong>OpenLLM v2</strong>
302
+ </td>
303
+ <td>MMLU-Pro (5-shot)
304
+ </td>
305
+ <td>23.45
306
+ </td>
307
+ <td>21.38
308
+ </td>
309
+ <td>91.1%
310
+ </td>
311
+ </tr>
312
+ <tr>
313
+ <td>IFEval (0-shot)
314
+ </td>
315
+ <td>71.08
316
+ </td>
317
+ <td>70.93
318
+ </td>
319
+ <td>99.8%
320
+ </td>
321
+ </tr>
322
+ <tr>
323
+ <td>BBH (3-shot)
324
+ </td>
325
+ <td>7.13
326
+ </td>
327
+ <td>5.41
328
+ </td>
329
+ <td>---
330
+ </td>
331
+ </tr>
332
+ <tr>
333
+ <td>Math-lvl-5 (4-shot)
334
+ </td>
335
+ <td>35.91
336
+ </td>
337
+ <td>34.71
338
+ </td>
339
+ <td>96.7%
340
+ </td>
341
+ </tr>
342
+ <tr>
343
+ <td>GPQA (0-shot)
344
+ </td>
345
+ <td>0.11
346
+ </td>
347
+ <td>0.00
348
+ </td>
349
+ <td>---
350
+ </td>
351
+ </tr>
352
+ <tr>
353
+ <td>MuSR (0-shot)
354
+ </td>
355
+ <td>7.97
356
+ </td>
357
+ <td>7.18
358
+ </td>
359
+ <td>---
360
+ </td>
361
+ </tr>
362
+ <tr>
363
+ <td><strong>Average</strong>
364
+ </td>
365
+ <td><strong>24.28</strong>
366
+ </td>
367
+ <td><strong>23.27</strong>
368
+ </td>
369
+ <td><strong>95.8%</strong>
370
+ </td>
371
+ </tr>
372
+ <tr>
373
+ <td><strong>Multilingual</strong>
374
+ </td>
375
+ <td>MGSM (0-shot)
376
+ </td>
377
+ <td>22.10
378
+ </td>
379
+ <td>
380
+ </td>
381
+ <td>
382
+ </td>
383
+ </tr>
384
+ <tr>
385
+ <td rowspan="6" ><strong>Reasoning<br>(generation)</strong>
386
+ </td>
387
+ <td>AIME 2024
388
+ </td>
389
+ <td>43.96
390
+ </td>
391
+ <td>40.10
392
+ </td>
393
+ <td>91.2%
394
+ </td>
395
+ </tr>
396
+ <tr>
397
+ <td>AIME 2025
398
+ </td>
399
+ <td>32.29
400
+ </td>
401
+ <td>32.29
402
+ </td>
403
+ <td>100.0%
404
+ </td>
405
+ </tr>
406
+ <tr>
407
+ <td>GPQA diamond
408
+ </td>
409
+ <td>38.38
410
+ </td>
411
+ <td>38.89
412
+ </td>
413
+ <td>101.3%
414
+ </td>
415
+ </tr>
416
+ <tr>
417
+ <td>Math-lvl-5
418
+ </td>
419
+ <td>89.00
420
+ </td>
421
+ <td>88.80
422
+ </td>
423
+ <td>99.8%
424
+ </td>
425
+ </tr>
426
+ <tr>
427
+ <td>LiveCodeBench
428
+ </td>
429
+ <td>33.44
430
+ </td>
431
+ <td>
432
+ </td>
433
+ <td>
434
+ </td>
435
+ </tr>
436
  </table>