Best practice for QwQ-32B evaluation

#55
by wangxingjun778 - opened

Best practice: https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html
EvalScope LLM Evaluation Framework: https://github.com/modelscope/evalscope

  1. Support “Overthinking” and "Underthinking" evaluation
  2. Support performance evaluation by math-level

image.png

image.png

image.png

And some conclusions as follows:

image.png

Nice. how do you enforce the QwQ-32b to not overthinking?

It is estimated that optimization needs to be performed during the model training phase, such as designing a reward function specifically tailored to the difficulty of the problem and appropriately increasing penalty terms. Some tricks can also be referenced from this article: DAPO: an Open-Source LLM Reinforcement Learning System at Scale. https://arxiv.org/pdf/2503.14476

Nice. how do you enforce the QwQ-32b to not overthinking?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment