AceReason Evaluation Toolkit
We share our evaluation script and code in https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/evaluation.tar.gz
Environment
- vllm==0.7.3
- torch==2.5.1
- transformers==4.48.2
- 8x NVIDIA H100 80GB HBM3 (CUDA Version: 12.8)
Dataset Download
LiveCodeBench:
from datasets import load_dataset
ds = load_dataset(
"livecodebench/code_generation_lite",
version_tag="release_v6",
)["test"]
ds.to_json("data/livecodebench_problems.json", orient="records", lines=False)
Math: see data/*
Evaluation Script
For model generation on single seed, please use the following command:
bash generate_livecodebench.sh ${model_path} ${seed} ${output_path} ${model_type}
bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path} ${model_type}
bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path} ${model_type}
Please specify model_type as r1 for AceReason-Nemotron-1.0 models, and qwen for AceReason-Nemotron-1.1 models.
Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
bash run_livecodebench.sh ${model_path} ${output_path}
bash run_aime.sh ${model_path} ${output_path}
For benchmark evaluation, we provide the following evaluation command to reproduce our results:
python evaluate_livecodebench.py -g ${output_path}
python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime24.jsonl
python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime25.jsonl
Reference Results
We also left our generations into cache.tar.gz as references.
LiveCodeBench AceReason-Nemotron-1.0-7B (Avg@8)
=================================================================
Months Corrects Total Accuracy
2023-05 180 272 66.17647058823529
2023-06 238 312 76.28205128205128
2023-07 337 432 78.00925925925925
2023-08 185 288 64.23611111111111
2023-09 275 352 78.125
2023-10 257 352 73.01136363636364
2023-11 217 280 77.5
2023-12 228 320 71.25
2024-01 193 288 67.01388888888889
2024-02 169 256 66.015625
2024-03 234 360 65.0
2024-04 226 296 76.35135135135135
2024-05 211 288 73.26388888888889
05/23-05/24 2950 4096 72.021484375
2024-06 277 368 75.27173913043478
2024-07 223 344 64.82558139534883
2024-08 275 528 52.083333333333336
2024-09 204 376 54.255319148936174
2024-10 209 424 49.29245283018868
2024-11 216 456 47.36842105263158
2024-12 223 392 56.88775510204081
2025-01 161 408 39.46078431372549
06/24-01/25 1788 3296 54.24757281553398
2025-02 179 408 43.872549019607845
2025-03 258 544 47.4264705882353
2025-04 38 96 39.583333333333336
v5 1142 2232 51.16487455197132
v6 621 1400 44.357142857142854
LiveCodeBench AceReason-Nemotron-1.0-14B (Avg@8)
=================================================================
Months Corrects Total Accuracy
2023-05 211 272 77.57352941176471
2023-06 282 312 90.38461538461539
2023-07 393 432 90.97222222222223
2023-08 219 288 76.04166666666667
2023-09 315 352 89.48863636363636
2023-10 294 352 83.52272727272727
2023-11 229 280 81.78571428571429
2023-12 263 320 82.1875
2024-01 219 288 76.04166666666667
2024-02 201 256 78.515625
2024-03 296 360 82.22222222222223
2024-04 252 296 85.13513513513513
2024-05 233 288 80.90277777777777
05/23-05/24 3407 4096 83.1787109375
2024-06 311 368 84.51086956521739
2024-07 248 344 72.09302325581395
2024-08 299 528 56.628787878787875
2024-09 232 376 61.702127659574465
2024-10 266 424 62.735849056603776
2024-11 282 456 61.8421052631579
2024-12 253 392 64.54081632653062
2025-01 217 408 53.18627450980392
06/24-01/25 2108 3296 63.95631067961165
2025-02 211 408 51.71568627450981
2025-03 324 544 59.55882352941177
2025-04 41 96 42.708333333333336
v5 1350 2232 60.483870967741936
v6 775 1400 55.357142857142854
LiveCodeBench AceReason-Nemotron-1.1-7B (Avg@8)
=================================================================
Months Corrects Total Accuracy
2023-05 205 272 75.36764705882354
2023-06 255 312 81.73076923076923
2023-07 356 432 82.4074074074074
2023-08 208 288 72.22222222222223
2023-09 287 352 81.5340909090909
2023-10 278 352 78.97727272727273
2023-11 234 280 83.57142857142857
2023-12 263 320 82.1875
2024-01 215 288 74.65277777777777
2024-02 182 256 71.09375
2024-03 270 360 75.0
2024-04 254 296 85.8108108108108
2024-05 221 288 76.73611111111111
05/23-05/24 3228 4096 78.80859375
2024-06 309 368 83.96739130434783
2024-07 235 344 68.31395348837209
2024-08 292 528 55.303030303030305
2024-09 211 376 56.11702127659574
2024-10 254 424 59.905660377358494
2024-11 269 456 58.99122807017544
2024-12 239 392 60.96938775510204
2025-01 194 408 47.549019607843135
06/24-01/25 2003 3296 60.77063106796116
2025-02 203 408 49.754901960784316
2025-03 306 544 56.25
2025-04 41 96 42.708333333333336
v5 1283 2232 57.482078853046595
v6 726 1400 51.857142857142854
AceReason-Nemotron-7B
====================================
AIME2024 (Avg@64) 68.64583333333334
AIME2025 (Avg@64) 53.59375000000002
AceReason-Nemotron-14B
====================================
AIME2024 (Avg@64) 78.43749999999997
AIME2025 (Avg@64) 67.65625
AceReason-Nemotron-1.1-7B
====================================
AIME2024 (Avg@64) 72.60416666666667
AIME2025 (Avg@64) 64.84375