Papers
arxiv:2505.03005

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Published on May 5
· Submitted by SmerkyG on May 7
#3 Paper of the day
Authors:
,

Abstract

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

Community

Paper author Paper submitter

RADLADS rapidly converts softmax-attention based Transformers into Linear Attention models while maintaining high model quality! We present two new RWKV based architectures that facilitate this conversion, and a detailed description of the distillation process and hyperparameters. We hope that this will help other researchers rapidly test new attention architectures at scale by reducing the burden of pre-training.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.03005 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.03005 in a Space README.md to link it from this page.

Collections including this paper 1