AIME Evaluation result

#3
by Shuaiqi - opened

Seems qwen3 used max output length 38,912 tokens for AIME’24 and AIME’25 evaluation
But Baichuan-M2-32B used 64K tokens

image.png

image.png

image.png

Sign up or log in to comment