BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Abstract
BLEUBERI uses BLEU as a reward function to optimize instruction-following in Language Models, achieving quality comparable to models aligned with full reward models.
Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RM-R1: Reward Modeling as Reasoning (2025)
- Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data (2025)
- Reward Reasoning Model (2025)
- More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment (2025)
- R3: Robust Rubric-Agnostic Reward Models (2025)
- CHARM: Calibrating Reward Models With Chatbot Arena Scores (2025)
- Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper