Weak-to-Strong Extrapolation Expedites Alignment
Abstract
Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.
Community
ExPO, standing for model extrapolation, is an extremely simple, efficient, and scalable method to boost the alignment of LLMs with human preference.
Evaluation results on the AlpacaEval 2.0 benchmark (you can find the evaluation outputs on the official GitHub repo):
Win Rate (Ori) | LC Win Rate (Ori) | Win Rate (+ ExPO) | LC Win Rate (+ ExPO) | |
---|---|---|---|---|
HuggingFaceH4/zephyr-7b-alpha |
6.7% | 10.0% | 10.6% | 13.6% |
HuggingFaceH4/zephyr-7b-beta |
10.2% | 13.2% | 11.1% | 14.0% |
berkeley-nest/Starling-LM-7B-alpha |
15.0% | 18.3% | 18.2% | 19.5% |
Nexusflow/Starling-LM-7B-beta |
26.6% | 25.8% | 29.6% | 26.4% |
snorkelai/Snorkel-Mistral-PairRM |
24.7% | 24.0% | 28.8% | 26.4% |
RLHFlow/LLaMA3-iterative-DPO-final |
29.2% | 36.0% | 32.7% | 37.8% |
internlm/internlm2-chat-1.8b |
3.8% | 4.0% | 5.2% | 4.3% |
internlm/internlm2-chat-7b |
20.5% | 18.3% | 28.1% | 22.7% |
internlm/internlm2-chat-20b |
36.1% | 24.9% | 46.2% | 27.2% |
allenai/tulu-2-dpo-7b |
8.5% | 10.2% | 11.5% | 11.7% |
allenai/tulu-2-dpo-13b |
11.2% | 15.5% | 15.6% | 17.6% |
allenai/tulu-2-dpo-70b |
15.4% | 21.2% | 23.0% | 25.7% |
Evaluation results on the MT-Bench benchmark (you can find the evaluation outputs on the official GitHub repo):
Original | + ExPO | |
---|---|---|
HuggingFaceH4/zephyr-7b-alpha |
6.85 | 6.87 |
HuggingFaceH4/zephyr-7b-beta |
7.02 | 7.06 |
berkeley-nest/Starling-LM-7B-alpha |
7.82 | 7.91 |
Nexusflow/Starling-LM-7B-beta |
8.10 | 8.18 |
snorkelai/Snorkel-Mistral-PairRM |
7.63 | 7.69 |
RLHFlow/LLaMA3-iterative-DPO-final |
8.08 | 8.45 |
internlm/internlm2-chat-1.8b |
5.17 | 5.26 |
internlm/internlm2-chat-7b |
7.72 | 7.80 |
internlm/internlm2-chat-20b |
8.13 | 8.26 |
allenai/tulu-2-dpo-7b |
6.35 | 6.38 |
allenai/tulu-2-dpo-13b |
7.00 | 7.26 |
allenai/tulu-2-dpo-70b |
7.79 | 8.03 |
Models citing this paper 34
Browse 34 models citing this paperDatasets citing this paper 0
No dataset linking this paper