Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping Paper • 2402.14083 • Published Feb 21 • 47
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Paper • 2305.13245 • Published May 22, 2023 • 5
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints Paper • 2212.05055 • Published Dec 9, 2022 • 5
LongT5: Efficient Text-To-Text Transformer for Long Sequences Paper • 2112.07916 • Published Dec 15, 2021 • 2
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Paper • 1910.10683 • Published Oct 23, 2019 • 10
The Impact of Positional Encoding on Length Generalization in Transformers Paper • 2305.19466 • Published May 31, 2023 • 2
ByT5: Towards a token-free future with pre-trained byte-to-byte models Paper • 2105.13626 • Published May 28, 2021 • 3