Regarding the rewards in the knockout tournament

#1
by droussis - opened

Hi,
First of all, congrats for your work!

I just wanted to ask how are the rewards calculated for the Knockout Tournament when evaluating your model for post-training with GRPO.

Calculating the rewards through ELO Rating seems somewhat straightforward in assigning reward scores to a group of responses.
However, it's not that straightforward for a group of responses in a knockout tournament.

Do you assign rewards based on which round was a response eliminated? In other words, highest for the winner and then progressively lower reward based on which round it was eliminated?

Thank you very much and once again congrats for the extremely useful and detailed work.

Hi,
Thank you so much for your kind words and your interest in our work β€” we really appreciate it!

Regarding your question: when performing post-training with GRPO, we do not directly assign rewards based on the knockout tournament results (e.g., which round a response was eliminated). Instead, as mentioned in our paper, we compute the rewards using the ELO rating system.

Please feel free to let us know if you'd like more details β€” and thanks again for the thoughtful question!

Reward-Reasoning changed discussion status to closed

Hi,
Thank you for you response!

So, if I understood correctly, you used the knockout tournament only in the context of Best-of-N inference (e.g., comparing with ELO rating in section 4.5.1).

Once again, congratulations for your work!

Sign up or log in to comment