Papers
arxiv:2504.03790

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Published on Apr 4
ยท Submitted by graf on Apr 8
Authors:

Abstract

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Community

Paper author Paper submitter

Introducing ๐—ค๐—”๐—น๐—ถ๐—ด๐—ป๐Ÿš€, a ๐˜๐—ฒ๐˜€๐˜-๐˜๐—ถ๐—บ๐—ฒ ๐—ฎ๐—น๐—ถ๐—ด๐—ป๐—บ๐—ฒ๐—ป๐˜ ๐—บ๐—ฒ๐˜๐—ต๐—ผ๐—ฑ that improves language model performance using Markov chain Monte Carlo.
With no model retraining, ๐—ค๐—”๐—น๐—ถ๐—ด๐—ป outperforms DPO-tuned models even when allowed to match inference compute, and achieves a 57% increase in average accuracy compared to a single generation across a suite of benchmarks.๐Ÿงต

Most alignment methods (like PPO, DPO) do three things: 1๏ธโƒฃ Compress many diverse preferences into a single, monolithic model. 2๏ธโƒฃ Require finetuning of the LM. 3๏ธโƒฃ Are unusable if model weights are private (e.g., GPT-4).
QAlign flips the script: it performs test-time alignment โ€“ it aligns outputs locally per prompt at test timeโ€”no retraining, no access to weights or logits.

Across math reasoning tasks (๐Ÿ“š GSM8K, ๐Ÿง  MATH500), knowledge recall (๐Ÿ“– MMLU), and alignment benchmarks (โš–๏ธ TruthfulQA, ๐Ÿ“ IFEval), it consistently outperforms Majority Voting, Best-of-n, and Weighted Majority Voting. It also outperforms a DPO model trained on the same preference data as the RM even when the DPO model is allowed to match inference compute at test-time with Majority voting.

We view alignment as Bayesian inference. PPO/DPO effectively do global, amortized variational approximations on our preference dataset. Instead, QAlign leverages test-time compute to locally approximate the posterior for each prompt really well via Markov Chain Monte Carlo with LLMs (based on QUEST).

QAlign is easy to use:

pip install quest-decoding

Then align any base LM to your own private reward model!

Try QAlign๐Ÿš€ yourself:

๐Ÿ”— Website: https://questdecoding.com/alignment
๐Ÿ“„ Paper: https://arxiv.org/abs/2504.03790
๐Ÿ“‚ Code & Examples: https://github.com/goncalorafaria/qalign

Joint work w/ @nlpnoah

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.03790 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.03790 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.03790 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.