Add comprehensive model card for RetNet model
Browse filesThis PR adds a comprehensive model card for this RetNet model, which is part of the research presented in the paper "[A Systematic Analysis of Hybrid Linear Attention](https://huggingface.co/papers/2507.06457)".
The model card includes:
- Relevant metadata (`license`, `library_name`, `pipeline_tag`) to enhance discoverability on the Hugging Face Hub.
- Links to the associated paper, project page (collection), and GitHub repository for the Flash Linear Attention project.
- A clear description of the model and a usage example using the `transformers` library for text generation.
- The academic citation for the related works.
README.md
ADDED
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: text-generation
|
5 |
+
---
|
6 |
+
|
7 |
+
# A Systematic Analysis of Hybrid Linear Attention: RetNet Model
|
8 |
+
|
9 |
+
This repository contains a RetNet model checkpoint, which is part of the comprehensive research presented in the paper [**A Systematic Analysis of Hybrid Linear Attention**](https://huggingface.co/papers/2507.06457). This work systematically evaluates various linear attention models, both standalone and in hybrid architectures, to address the quadratic complexity and memory issues of traditional Transformers with long sequences.
|
10 |
+
|
11 |
+
The paper highlights the development and open-sourcing of 72 models (36 at 340M parameters and 36 at 1.3B parameters), covering six linear attention variants across five hybridization ratios. This allows for a comprehensive analysis on standard language modeling and recall tasks, revealing insights into effective architectural choices for achieving Transformer-level recall efficiently.
|
12 |
+
|
13 |
+
## About this Model
|
14 |
+
|
15 |
+
This specific model is a **RetNet** variant, an architecture investigated in the "A Systematic Analysis of Hybrid Linear Attention" research. It is a 1.3B parameter model trained on 100B tokens, designed for efficient sequence modeling and text generation.
|
16 |
+
|
17 |
+
## Usage
|
18 |
+
|
19 |
+
This model is compatible with the Hugging Face `transformers` library. You can load and use it for text generation as follows:
|
20 |
+
|
21 |
+
```python
|
22 |
+
import torch
|
23 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
24 |
+
|
25 |
+
# Replace with the specific model ID of this repository, e.g., 'fla-hub/retnet-1.3B-100B'
|
26 |
+
model_name = "fla-hub/retnet-1.3B-100B"
|
27 |
+
|
28 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
29 |
+
model = AutoModelForCausalLM.from_pretrained(
|
30 |
+
model_name,
|
31 |
+
torch_dtype=torch.bfloat16,
|
32 |
+
device_map="auto",
|
33 |
+
trust_remote_code=True # Required for custom architectures like RetNet
|
34 |
+
).eval()
|
35 |
+
|
36 |
+
input_prompt = "Power goes with permanence. Impermanence is impotence. And rotation is castration."
|
37 |
+
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.cuda()
|
38 |
+
|
39 |
+
# Generate text
|
40 |
+
outputs = model.generate(input_ids, max_new_tokens=64)
|
41 |
+
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
|
42 |
+
|
43 |
+
print(f"Prompt:
|
44 |
+
{input_prompt}
|
45 |
+
")
|
46 |
+
print(f"Generated:
|
47 |
+
{generated_text}")
|
48 |
+
```
|
49 |
+
|
50 |
+
## Paper
|
51 |
+
|
52 |
+
The model was presented in the paper:
|
53 |
+
[**A Systematic Analysis of Hybrid Linear Attention**](https://huggingface.co/papers/2507.06457)
|
54 |
+
|
55 |
+
## Project Page
|
56 |
+
|
57 |
+
Explore more models and research related to Flash Linear Attention in the Hugging Face collection:
|
58 |
+
[**Hybrid Linear Attention Research Collection**](https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e)
|
59 |
+
|
60 |
+
## Code
|
61 |
+
|
62 |
+
The official implementation and more details regarding the Flash Linear Attention (FLA) project can be found on its GitHub repository:
|
63 |
+
[**fla-org/flash-linear-attention**](https://github.com/fla-org/flash-linear-attention)
|
64 |
+
|
65 |
+
## Citation
|
66 |
+
|
67 |
+
If you find this work useful, please consider citing the original paper and the FLA library:
|
68 |
+
|
69 |
+
```bib
|
70 |
+
@software{yang2024fla,
|
71 |
+
title = {FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism},
|
72 |
+
author = {Yang, Songlin and Zhang, Yu},
|
73 |
+
url = {https://github.com/fla-org/flash-linear-attention},
|
74 |
+
month = jan,
|
75 |
+
year = {2024}
|
76 |
+
}
|
77 |
+
|
78 |
+
@inproceedings{yang2024gdn,
|
79 |
+
title = {Gated Delta Networks: Improving Mamba2 with Delta Rule},
|
80 |
+
author = {Songlin Yang and Jan Kautz and Ali Hatamizadeh},
|
81 |
+
booktitle = {Proceedings of ICLR},
|
82 |
+
year = {2025}
|
83 |
+
}
|
84 |
+
|
85 |
+
@inproceedings{yang2024deltanet,
|
86 |
+
title = {Parallelizing Linear Transformers with the Delta Rule over Sequence Length},
|
87 |
+
author = {Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Kim, Yoon},
|
88 |
+
booktitle = {Proceedings of NeurIPS},
|
89 |
+
year = {2024}
|
90 |
+
}
|
91 |
+
|
92 |
+
@inproceedings{zhang2024gsa,
|
93 |
+
title = {Gated Slot Attention for Efficient Linear-Time Sequence Modeling},
|
94 |
+
author = {Zhang, Yu and Yang, Songlin and Zhu, Ruijie and Zhang, Yue and Cui, Leyang and Wang, Yiqiao and Wang, Bolun and Shi, Freda and Wang, Bailin and Bi, Wei and Zhou, Peng and Fu, Guohong},
|
95 |
+
booktitle = {Proceedings of NeurIPS},
|
96 |
+
year = {2024}
|
97 |
+
}
|
98 |
+
|
99 |
+
@inproceedings{qin2024hgrn2,
|
100 |
+
title = {HGRN2: Gated Linear RNNs with State Expansion},
|
101 |
+
author = {Qin, Zhen and Yang, Songlin and Sun, Weixuan and Shen, Xuyang and Li, Dong and Sun, Weigao and Zhong, Yiran},
|
102 |
+
booktitle = {Proceedings of COLM},
|
103 |
+
year = {2024}
|
104 |
+
}
|
105 |
+
|
106 |
+
@inproceedings{yang2024gla,
|
107 |
+
title = {Gated Linear Attention Transformers with Hardware-Efficient Training},
|
108 |
+
author = {Yang, Songlin and Wang, Bailin and Shen, Yikang and Panda, Rameswar and Kim, Yoon},
|
109 |
+
booktitle = {Proceedings of ICML},
|
110 |
+
year = {2024}
|
111 |
+
}
|
112 |
+
```
|