On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral Paper • 2512.04220 • Published 28 days ago • 13
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning Paper • 2510.03669 • Published Oct 4, 2025 • 1