Papers
arxiv:2504.20157

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Published on Apr 28
· Submitted by passing2961 on Apr 30
Authors:
,

Abstract

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

Community

Paper submitter

I think the current RLAIF training pipelines, such as those based on PPO or GRPO, are relatively naive in that they fail to account for the evolving training context within the reward modeling process. This paper introduces a simple yet effective meta-level reward mechanism that integrates into existing PPO frameworks, substantially improving performance while reducing reliance on prompt engineering and mitigating reward hacking.

I wonder the setting where reward model is already senior model over 1T size. Is meta model not necessary?

·

Even a trillion-parameter reward model (RM) can be exploited if its evaluation rubric remains fixed. In our experiments, a 72B Qwen-based RM consistently assigned perfect (5/5) scores to some of the degenerate responses like “Title: The Myth of Reddit’s Inherent Badness …”—a one-liner response that is clearly misaligned with the task. RL algorithms like PPO are highly effective at uncovering and exploiting such loopholes, and without the RM adapting its evaluation criteria, training can converge to a flawed policy.

Of course, another key advantage of incorporating a meta-reward model (MRM) is that it automates rubric refinement. This means that you don't have to prompt-engineer the evaluation prompt for your 1T RM.

Hope this answers your question!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.20157 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.20157 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.20157 in a Space README.md to link it from this page.

Collections including this paper 5