Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck Paper • 2404.07647 • Published Apr 11, 2024 • 4
On the Scaling Laws of Geographical Representation in Language Models Paper • 2402.19406 • Published Feb 29, 2024
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling Paper • 2212.07284 • Published Dec 14, 2022
Headless Language Models: Learning without Predicting with Contrastive Weight Tying Paper • 2309.08351 • Published Sep 15, 2023 • 3