arxiv:2504.03790

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Published on Apr 4

· Submitted by

graf on Apr 8

Upvote

Authors:

Gonçalo Faria ,

Abstract

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

View arXiv page View PDF Add to collection

Community

graf

Paper author Paper submitter 7 days ago

Introducing 𝗤𝗔𝗹𝗶𝗴𝗻🚀, a 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗺𝗲𝘁𝗵𝗼𝗱 that improves language model performance using Markov chain Monte Carlo.
With no model retraining, 𝗤𝗔𝗹𝗶𝗴𝗻 outperforms DPO-tuned models even when allowed to match inference compute, and achieves a 57% increase in average accuracy compared to a single generation across a suite of benchmarks.🧵

Most alignment methods (like PPO, DPO) do three things: 1️⃣ Compress many diverse preferences into a single, monolithic model. 2️⃣ Require finetuning of the LM. 3️⃣ Are unusable if model weights are private (e.g., GPT-4).
QAlign flips the script: it performs test-time alignment – it aligns outputs locally per prompt at test time—no retraining, no access to weights or logits.

Across math reasoning tasks (📚 GSM8K, 🧠 MATH500), knowledge recall (📖 MMLU), and alignment benchmarks (⚖️ TruthfulQA, 📝 IFEval), it consistently outperforms Majority Voting, Best-of-n, and Weighted Majority Voting. It also outperforms a DPO model trained on the same preference data as the RM even when the DPO model is allowed to match inference compute at test-time with Majority voting.

We view alignment as Bayesian inference. PPO/DPO effectively do global, amortized variational approximations on our preference dataset. Instead, QAlign leverages test-time compute to locally approximate the posterior for each prompt really well via Markov Chain Monte Carlo with LLMs (based on QUEST).

QAlign is easy to use:

pip install quest-decoding

Then align any base LM to your own private reward model!

Try QAlign🚀 yourself:

🔗 Website: https://questdecoding.com/alignment
📄 Paper: https://arxiv.org/abs/2504.03790
📂 Code & Examples: https://github.com/goncalorafaria/qalign

Joint work w/ @nlpnoah

librarian-bot

7 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.03790 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.03790 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.03790 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.