Inconsistency on AIME 24 benchmark
#15
by
Jung
- opened
Hi there are inconsistency in AIME 2024 results for phi4-reasoning-plus
in the paper, avg pass@1 is 81.3
in this blog https://www.microsoft.com/en-us/research/articles/phi-reasoning-once-again-redefining-what-is-possible-with-small-and-efficient-ai/
Figure 3, it can be read as 89.4
(Other numbers e.g. OmniMath and AIME25 are consistent)
Jung
changed discussion title from
Benchmark on AIME 24
to Inconsistencu on AIME 24 benchmark
Jung
changed discussion title from
Inconsistencu on AIME 24 benchmark
to Inconsistency on AIME 24 benchmark