ychenNLP commited on
Commit
8212cdd
·
verified ·
1 Parent(s): 5d7ec87

Update README_EVALUATION.md

Browse files
Files changed (1) hide show
  1. README_EVALUATION.md +44 -6
README_EVALUATION.md CHANGED
@@ -28,10 +28,11 @@ Math: see data/*
28
  For model generation on single seed, please use the following command:
29
 
30
  ```
31
- bash generate_livecodebench.sh ${model_path} ${seed} ${output_path}
32
- bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path}
33
- bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path}
34
  ```
 
35
 
36
  Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
37
 
@@ -52,7 +53,7 @@ python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime25.jso
52
  We also left our generations into cache.tar.gz as references.
53
 
54
  ```
55
- LiveCodeBench AceReason-Nemotron-7B (Avg@8)
56
  =================================================================
57
  Months Corrects Total Accuracy
58
  2023-05 180 272 66.17647058823529
@@ -84,7 +85,7 @@ Months Corrects Total Accuracy
84
  v5 1142 2232 51.16487455197132
85
  v6 621 1400 44.357142857142854
86
 
87
- LiveCodeBench AceReason-Nemotron-14B (Avg@8)
88
  =================================================================
89
  Months Corrects Total Accuracy
90
  2023-05 211 272 77.57352941176471
@@ -116,6 +117,38 @@ Months Corrects Total Accuracy
116
  v5 1350 2232 60.483870967741936
117
  v6 775 1400 55.357142857142854
118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  AceReason-Nemotron-7B
120
  ====================================
121
  AIME2024 (Avg@64) 68.64583333333334
@@ -125,4 +158,9 @@ AceReason-Nemotron-14B
125
  ====================================
126
  AIME2024 (Avg@64) 78.43749999999997
127
  AIME2025 (Avg@64) 67.65625
128
- ```
 
 
 
 
 
 
28
  For model generation on single seed, please use the following command:
29
 
30
  ```
31
+ bash generate_livecodebench.sh ${model_path} ${seed} ${output_path} ${model_type}
32
+ bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path} ${model_type}
33
+ bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path} ${model_type}
34
  ```
35
+ Please specify model_type as r1 for AceReason-Nemotron-1.0 models, and qwen for AceReason-Nemotron-1.1 models.
36
 
37
  Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
38
 
 
53
  We also left our generations into cache.tar.gz as references.
54
 
55
  ```
56
+ LiveCodeBench AceReason-Nemotron-1.0-7B (Avg@8)
57
  =================================================================
58
  Months Corrects Total Accuracy
59
  2023-05 180 272 66.17647058823529
 
85
  v5 1142 2232 51.16487455197132
86
  v6 621 1400 44.357142857142854
87
 
88
+ LiveCodeBench AceReason-Nemotron-1.0-14B (Avg@8)
89
  =================================================================
90
  Months Corrects Total Accuracy
91
  2023-05 211 272 77.57352941176471
 
117
  v5 1350 2232 60.483870967741936
118
  v6 775 1400 55.357142857142854
119
 
120
+ LiveCodeBench AceReason-Nemotron-1.1-7B (Avg@8)
121
+ =================================================================
122
+ Months Corrects Total Accuracy
123
+ 2023-05 205 272 75.36764705882354
124
+ 2023-06 255 312 81.73076923076923
125
+ 2023-07 356 432 82.4074074074074
126
+ 2023-08 208 288 72.22222222222223
127
+ 2023-09 287 352 81.5340909090909
128
+ 2023-10 278 352 78.97727272727273
129
+ 2023-11 234 280 83.57142857142857
130
+ 2023-12 263 320 82.1875
131
+ 2024-01 215 288 74.65277777777777
132
+ 2024-02 182 256 71.09375
133
+ 2024-03 270 360 75.0
134
+ 2024-04 254 296 85.8108108108108
135
+ 2024-05 221 288 76.73611111111111
136
+ 05/23-05/24 3228 4096 78.80859375
137
+ 2024-06 309 368 83.96739130434783
138
+ 2024-07 235 344 68.31395348837209
139
+ 2024-08 292 528 55.303030303030305
140
+ 2024-09 211 376 56.11702127659574
141
+ 2024-10 254 424 59.905660377358494
142
+ 2024-11 269 456 58.99122807017544
143
+ 2024-12 239 392 60.96938775510204
144
+ 2025-01 194 408 47.549019607843135
145
+ 06/24-01/25 2003 3296 60.77063106796116
146
+ 2025-02 203 408 49.754901960784316
147
+ 2025-03 306 544 56.25
148
+ 2025-04 41 96 42.708333333333336
149
+ v5 1283 2232 57.482078853046595
150
+ v6 726 1400 51.857142857142854
151
+
152
  AceReason-Nemotron-7B
153
  ====================================
154
  AIME2024 (Avg@64) 68.64583333333334
 
158
  ====================================
159
  AIME2024 (Avg@64) 78.43749999999997
160
  AIME2025 (Avg@64) 67.65625
161
+
162
+ AceReason-Nemotron-1.1-7B
163
+ ====================================
164
+ AIME2024 (Avg@64) 72.60416666666667
165
+ AIME2025 (Avg@64) 64.84375
166
+ ```