SELF: Language-Driven Self-Evolution for Large Language Model Paper • 2310.00533 • Published Oct 1, 2023 • 2
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length Paper • 2310.00576 • Published Oct 1, 2023 • 2
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity Paper • 2305.13169 • Published May 22, 2023 • 3
Transformers Can Achieve Length Generalization But Not Robustly Paper • 2402.09371 • Published Feb 14 • 13
Triple-Encoders: Representations That Fire Together, Wire Together Paper • 2402.12332 • Published Feb 19 • 2
Chain-of-Verification Reduces Hallucination in Large Language Models Paper • 2309.11495 • Published Sep 20, 2023 • 38
Contrastive Decoding Improves Reasoning in Large Language Models Paper • 2309.09117 • Published Sep 17, 2023 • 37
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws Paper • 2404.05405 • Published Apr 8 • 9
Neural Tangent Kernel: Convergence and Generalization in Neural Networks Paper • 1806.07572 • Published Jun 20, 2018 • 1
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models Paper • 2411.12580 • Published Nov 19 • 2
Studying Large Language Model Generalization with Influence Functions Paper • 2308.03296 • Published Aug 7, 2023 • 12
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations Paper • 2308.11466 • Published Aug 22, 2023 • 1
ByT5: Towards a token-free future with pre-trained byte-to-byte models Paper • 2105.13626 • Published May 28, 2021 • 3
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation Paper • 2103.06874 • Published Mar 11, 2021 • 1
No More Adam: Learning Rate Scaling at Initialization is All You Need Paper • 2412.11768 • Published 10 days ago • 41
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 13 days ago • 75
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models Paper • 2410.20771 • Published Oct 28 • 3