Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
Abstract
Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.
Community
I think the current RLAIF training pipelines, such as those based on PPO or GRPO, are relatively naive in that they fail to account for the evolving training context within the reward modeling process. This paper introduces a simple yet effective meta-level reward mechanism that integrates into existing PPO frameworks, substantially improving performance while reducing reliance on prompt engineering and mitigating reward hacking.
I wonder the setting where reward model is already senior model over 1T size. Is meta model not necessary?
Even a trillion-parameter reward model (RM) can be exploited if its evaluation rubric remains fixed. In our experiments, a 72B Qwen-based RM consistently assigned perfect (5/5) scores to some of the degenerate responses like “Title: The Myth of Reddit’s Inherent Badness …”—a one-liner response that is clearly misaligned with the task. RL algorithms like PPO are highly effective at uncovering and exploiting such loopholes, and without the RM adapting its evaluation criteria, training can converge to a flawed policy.
Of course, another key advantage of incorporating a meta-reward model (MRM) is that it automates rubric refinement. This means that you don't have to prompt-engineer the evaluation prompt for your 1T RM.
Hope this answers your question!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reward Design for Reinforcement Learning Agents (2025)
- GVPO: Group Variance Policy Optimization for Large Language Model Post-Training (2025)
- What Makes a Reward Model a Good Teacher? An Optimization Perspective (2025)
- Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution (2025)
- VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences (2025)
- A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization (2025)
- Energy-Based Reward Models for Robust Language Model Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper