math - a u-brixton Collection

u-brixton 's Collections

math

foundation_models

alignment_24_best

monte_carlo_24_best

math

updated Dec 13, 2024

Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning

Paper • 2402.17457 • Published Feb 27, 2024
Curvature-Informed SGD via General Purpose Lie-Group Preconditioners

Paper • 2402.04553 • Published Feb 7, 2024
TextGrad: Automatic "Differentiation" via Text

Paper • 2406.07496 • Published Jun 11, 2024 • 32
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Paper • 2405.14578 • Published May 23, 2024 • 1
Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates

Paper • 2206.00832 • Published Jun 2, 2022
Large Language Models as Markov Chains

Paper • 2410.02724 • Published Oct 3, 2024 • 33
Old Optimizer, New Norm: An Anthology

Paper • 2409.20325 • Published Sep 30, 2024 • 4
Scaling Law with Learning Rate Annealing

Paper • 2408.11029 • Published Aug 20, 2024 • 3
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

Paper • 2410.23743 • Published Oct 31, 2024 • 64
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

Paper • 2410.09637 • Published Oct 12, 2024 • 4
In-context learning and Occam's razor

Paper • 2410.14086 • Published Oct 17, 2024 • 2
nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Paper • 2410.01131 • Published Oct 1, 2024 • 10
Cautious Optimizers: Improving Training with One Line of Code

Paper • 2411.16085 • Published Nov 25, 2024 • 21
MARS: Unleashing the Power of Variance Reduction for Training Large Models

Paper • 2411.10438 • Published Nov 15, 2024 • 13
Understanding Gradient Descent through the Training Jacobian

Paper • 2412.07003 • Published Dec 9, 2024