The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training Paper • 2501.18965 • Published Jan 31 • 7
SGD with Clipping is Secretly Estimating the Median Gradient Paper • 2402.12828 • Published Feb 20, 2024