arxiv:2503.07426

RePO: ReLU-based Preference Optimization

Published on Mar 10

· Submitted by

junkang0909 on Mar 11

Upvote

Authors:

Junkang Wu ,

Kexin Huang ,

Xiang Wang

Abstract

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter beta, subsequent methods like SimPO reintroduce complexity through dual parameters (beta, gamma). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates beta via two advances: (1) retaining SimPO's reference-free margins but removing beta through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (beta to infty), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

View arXiv page View PDF Add to collection

Community

junkang0909

Paper author Paper submitter 1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.07426 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.07426 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.07426 in a Space README.md to link it from this page.