Language models scale reliably with over-training and on downstream tasks Paper • 2403.08540 • Published Mar 13, 2024 • 15
Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond Paper • 2305.13064 • Published May 22, 2023