arxiv:2504.20157

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Published on Apr 28

· Submitted by

passing2961 on Apr 30

Upvote

Authors:

Zae Myung Kim ,

Vipul Raheja ,

Dongyeop Kang

Abstract

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

View arXiv page View PDF Add to collection

Community

passing2961

Paper submitter 1 day ago

I think the current RLAIF training pipelines, such as those based on PPO or GRPO, are relatively naive in that they fail to account for the evolving training context within the reward modeling process. This paper introduces a simple yet effective meta-level reward mechanism that integrates into existing PPO frameworks, substantially improving performance while reducing reliance on prompt engineering and mitigating reward hacking.

BK-Lee

1 day ago

I wonder the setting where reward model is already senior model over 1T size. Is meta model not necessary?

zaemyung

Paper author 1 day ago

•

edited 1 day ago

Even a trillion-parameter reward model (RM) can be exploited if its evaluation rubric remains fixed. In our experiments, a 72B Qwen-based RM consistently assigned perfect (5/5) scores to some of the degenerate responses like “Title: The Myth of Reddit’s Inherent Badness …”—a one-liner response that is clearly misaligned with the task. RL algorithms like PPO are highly effective at uncovering and exploiting such loopholes, and without the RM adapting its evaluation criteria, training can converge to a flawed policy.

Of course, another key advantage of incorporating a meta-reward model (MRM) is that it automates rubric refinement. This means that you don't have to prompt-engineer the evaluation prompt for your 1T RM.

Hope this answers your question!