Learning Explainable Dense Reward Shapes via Bayesian Optimization
Abstract
Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation (2025)
- Energy-Based Reward Models for Robust Language Model Alignment (2025)
- Direct Advantage Regression: Aligning LLMs with Online AI Reward (2025)
- Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization (2025)
- Supervised Optimism Correction: Be Confident When LLMs Are Sure (2025)
- MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling (2025)
- A Survey of Direct Preference Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper