Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL Paper • 2603.19470 • Published 7 days ago • 2
Self-rewarding correction for mathematical reasoning Paper • 2502.19613 • Published Feb 26, 2025 • 82