Inconsistency on AIME 24 benchmark

#15
by Jung - opened

Hi there are inconsistency in AIME 2024 results for phi4-reasoning-plus

  1. in the paper, avg pass@1 is 81.3

  2. in this blog https://www.microsoft.com/en-us/research/articles/phi-reasoning-once-again-redefining-what-is-possible-with-small-and-efficient-ai/
    Figure 3, it can be read as 89.4

(Other numbers e.g. OmniMath and AIME25 are consistent)

Jung changed discussion title from Benchmark on AIME 24 to Inconsistencu on AIME 24 benchmark
Jung changed discussion title from Inconsistencu on AIME 24 benchmark to Inconsistency on AIME 24 benchmark

Sign up or log in to comment