Update README_EVALUATION.md
Browse files- README_EVALUATION.md +44 -6
README_EVALUATION.md
CHANGED
@@ -28,10 +28,11 @@ Math: see data/*
|
|
28 |
For model generation on single seed, please use the following command:
|
29 |
|
30 |
```
|
31 |
-
bash generate_livecodebench.sh ${model_path} ${seed} ${output_path}
|
32 |
-
bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path}
|
33 |
-
bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path}
|
34 |
```
|
|
|
35 |
|
36 |
Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
|
37 |
|
@@ -52,7 +53,7 @@ python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime25.jso
|
|
52 |
We also left our generations into cache.tar.gz as references.
|
53 |
|
54 |
```
|
55 |
-
LiveCodeBench AceReason-Nemotron-7B (Avg@8)
|
56 |
=================================================================
|
57 |
Months Corrects Total Accuracy
|
58 |
2023-05 180 272 66.17647058823529
|
@@ -84,7 +85,7 @@ Months Corrects Total Accuracy
|
|
84 |
v5 1142 2232 51.16487455197132
|
85 |
v6 621 1400 44.357142857142854
|
86 |
|
87 |
-
LiveCodeBench AceReason-Nemotron-14B (Avg@8)
|
88 |
=================================================================
|
89 |
Months Corrects Total Accuracy
|
90 |
2023-05 211 272 77.57352941176471
|
@@ -116,6 +117,38 @@ Months Corrects Total Accuracy
|
|
116 |
v5 1350 2232 60.483870967741936
|
117 |
v6 775 1400 55.357142857142854
|
118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
AceReason-Nemotron-7B
|
120 |
====================================
|
121 |
AIME2024 (Avg@64) 68.64583333333334
|
@@ -125,4 +158,9 @@ AceReason-Nemotron-14B
|
|
125 |
====================================
|
126 |
AIME2024 (Avg@64) 78.43749999999997
|
127 |
AIME2025 (Avg@64) 67.65625
|
128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
28 |
For model generation on single seed, please use the following command:
|
29 |
|
30 |
```
|
31 |
+
bash generate_livecodebench.sh ${model_path} ${seed} ${output_path} ${model_type}
|
32 |
+
bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path} ${model_type}
|
33 |
+
bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path} ${model_type}
|
34 |
```
|
35 |
+
Please specify model_type as r1 for AceReason-Nemotron-1.0 models, and qwen for AceReason-Nemotron-1.1 models.
|
36 |
|
37 |
Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
|
38 |
|
|
|
53 |
We also left our generations into cache.tar.gz as references.
|
54 |
|
55 |
```
|
56 |
+
LiveCodeBench AceReason-Nemotron-1.0-7B (Avg@8)
|
57 |
=================================================================
|
58 |
Months Corrects Total Accuracy
|
59 |
2023-05 180 272 66.17647058823529
|
|
|
85 |
v5 1142 2232 51.16487455197132
|
86 |
v6 621 1400 44.357142857142854
|
87 |
|
88 |
+
LiveCodeBench AceReason-Nemotron-1.0-14B (Avg@8)
|
89 |
=================================================================
|
90 |
Months Corrects Total Accuracy
|
91 |
2023-05 211 272 77.57352941176471
|
|
|
117 |
v5 1350 2232 60.483870967741936
|
118 |
v6 775 1400 55.357142857142854
|
119 |
|
120 |
+
LiveCodeBench AceReason-Nemotron-1.1-7B (Avg@8)
|
121 |
+
=================================================================
|
122 |
+
Months Corrects Total Accuracy
|
123 |
+
2023-05 205 272 75.36764705882354
|
124 |
+
2023-06 255 312 81.73076923076923
|
125 |
+
2023-07 356 432 82.4074074074074
|
126 |
+
2023-08 208 288 72.22222222222223
|
127 |
+
2023-09 287 352 81.5340909090909
|
128 |
+
2023-10 278 352 78.97727272727273
|
129 |
+
2023-11 234 280 83.57142857142857
|
130 |
+
2023-12 263 320 82.1875
|
131 |
+
2024-01 215 288 74.65277777777777
|
132 |
+
2024-02 182 256 71.09375
|
133 |
+
2024-03 270 360 75.0
|
134 |
+
2024-04 254 296 85.8108108108108
|
135 |
+
2024-05 221 288 76.73611111111111
|
136 |
+
05/23-05/24 3228 4096 78.80859375
|
137 |
+
2024-06 309 368 83.96739130434783
|
138 |
+
2024-07 235 344 68.31395348837209
|
139 |
+
2024-08 292 528 55.303030303030305
|
140 |
+
2024-09 211 376 56.11702127659574
|
141 |
+
2024-10 254 424 59.905660377358494
|
142 |
+
2024-11 269 456 58.99122807017544
|
143 |
+
2024-12 239 392 60.96938775510204
|
144 |
+
2025-01 194 408 47.549019607843135
|
145 |
+
06/24-01/25 2003 3296 60.77063106796116
|
146 |
+
2025-02 203 408 49.754901960784316
|
147 |
+
2025-03 306 544 56.25
|
148 |
+
2025-04 41 96 42.708333333333336
|
149 |
+
v5 1283 2232 57.482078853046595
|
150 |
+
v6 726 1400 51.857142857142854
|
151 |
+
|
152 |
AceReason-Nemotron-7B
|
153 |
====================================
|
154 |
AIME2024 (Avg@64) 68.64583333333334
|
|
|
158 |
====================================
|
159 |
AIME2024 (Avg@64) 78.43749999999997
|
160 |
AIME2025 (Avg@64) 67.65625
|
161 |
+
|
162 |
+
AceReason-Nemotron-1.1-7B
|
163 |
+
====================================
|
164 |
+
AIME2024 (Avg@64) 72.60416666666667
|
165 |
+
AIME2025 (Avg@64) 64.84375
|
166 |
+
```
|