Reinforcement Learning (RL) won't stuck in the same old PPO loop - in the last two months alone, researchers have introduced a new wave of techniques, reshaping how we train and fine-tune LLMs, VLMs, and agents.
Here are 9 fresh policy optimization techniques worth knowing:
1. GSPO: Group Sequence Policy Optimization → Group Sequence Policy Optimization (2507.18071) Shifts from token-level to sequence-level optimization, clipping, and rewarding to capture the full picture and increase stability compared to GRPO. GSPO-token variation also allows token-level fine-tuning.
3. HBPO: Hierarchical Budget Policy Optimization → Hierarchical Budget Policy Optimization for Adaptive Reasoning (2507.15844) This one trains model to adapt reasoning depth based on problem complexity. It divides training samples into subgroups with different token budgets, using budget-aware rewards to align reasoning effort with task difficulty.
5. RePO: Replay-Enhanced Policy Optimization → RePO: Replay-Enhanced Policy Optimization (2506.09340) Introduces a replay buffer into on-policy RL for LLMs, retrieving diverse off-policy samples for each prompt to broaden the training data per prompt