Sample, Don't Search: Rethinking Test-Time Alignment for Language Models
Abstract
Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.
Community
Introducing ๐ค๐๐น๐ถ๐ด๐ป๐, a ๐๐ฒ๐๐-๐๐ถ๐บ๐ฒ ๐ฎ๐น๐ถ๐ด๐ป๐บ๐ฒ๐ป๐ ๐บ๐ฒ๐๐ต๐ผ๐ฑ that improves language model performance using Markov chain Monte Carlo.
With no model retraining, ๐ค๐๐น๐ถ๐ด๐ป outperforms DPO-tuned models even when allowed to match inference compute, and achieves a 57% increase in average accuracy compared to a single generation across a suite of benchmarks.๐งต
Most alignment methods (like PPO, DPO) do three things: 1๏ธโฃ Compress many diverse preferences into a single, monolithic model. 2๏ธโฃ Require finetuning of the LM. 3๏ธโฃ Are unusable if model weights are private (e.g., GPT-4).
QAlign flips the script: it performs test-time alignment โ it aligns outputs locally per prompt at test timeโno retraining, no access to weights or logits.
Across math reasoning tasks (๐ GSM8K, ๐ง MATH500), knowledge recall (๐ MMLU), and alignment benchmarks (โ๏ธ TruthfulQA, ๐ IFEval), it consistently outperforms Majority Voting, Best-of-n, and Weighted Majority Voting. It also outperforms a DPO model trained on the same preference data as the RM even when the DPO model is allowed to match inference compute at test-time with Majority voting.
We view alignment as Bayesian inference. PPO/DPO effectively do global, amortized variational approximations on our preference dataset. Instead, QAlign leverages test-time compute to locally approximate the posterior for each prompt really well via Markov Chain Monte Carlo with LLMs (based on QUEST).
QAlign is easy to use:
pip install quest-decoding
Then align any base LM to your own private reward model!
Try QAlign๐ yourself:
๐ Website: https://questdecoding.com/alignment
๐ Paper: https://arxiv.org/abs/2504.03790
๐ Code & Examples: https://github.com/goncalorafaria/qalign
Joint work w/ @nlpnoah
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling (2025)
- IPO: Your Language Model is Secretly a Preference Classifier (2025)
- Efficient Test-Time Scaling via Self-Calibration (2025)
- Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding (2025)
- DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models (2025)
- Bag of Tricks for Inference-time Computation of LLM Reasoning (2025)
- Iterative Deepening Sampling for Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper