LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Abstract
VRPO is a variance-reduced preference optimization framework for Masked Diffusion Models that significantly enhances their alignment with human preferences and performance across various benchmarks.
While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.
Community
By introducing a novel variance reduction technique, LLaDA 1.5 enhances the stability of diffusion language model alignment, resulting in more robust and efficient training.
https://ml-gsai.github.io/LLaDA-1.5-Demo/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfoPO: On Mutual Information Maximization for Large Language Model Alignment (2025)
- Anchored Diffusion Language Model (2025)
- Token-Importance Guided Direct Preference Optimization (2025)
- Rethinking Direct Preference Optimization in Diffusion Models (2025)
- MMaDA: Multimodal Large Diffusion Language Models (2025)
- GVPO: Group Variance Policy Optimization for Large Language Model Post-Training (2025)
- Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper