QRWKV in, Qwerky out
Browse files
README.md
CHANGED
@@ -7,13 +7,13 @@ library_name: transformers
|
|
7 |
|
8 |

|
9 |
|
10 |
-
- Try out the model on [](https://substack.recursal.ai/p/qwerky-72b-and-32b-training-large)
|
12 |
- This model was presented in [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
|
13 |
|
14 |
-
Benchmarks is as follows for both
|
15 |
|
16 |
-
| Tasks | Metric |
|
17 |
|:---:|:---:|:---:|:---:|:---:|:---:|
|
18 |
| arc_challenge | acc_norm | **0.5640** | 0.5563 | **0.6382** | 0.6323 |
|
19 |
| arc_easy | acc_norm | 0.7837 | **0.7866** | **0.8443** | 0.8329 |
|
@@ -33,7 +33,7 @@ Since this model is not on transformers at the moment you will have to enable re
|
|
33 |
```py
|
34 |
# ...
|
35 |
|
36 |
-
model = AutoModelForCausalLM.from_pretrained("featherless-ai/
|
37 |
|
38 |
# ...
|
39 |
```
|
@@ -43,7 +43,7 @@ Other than enabling remote code, you may run the model like a regular model with
|
|
43 |
```py
|
44 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
45 |
|
46 |
-
model_name = "featherless-ai/
|
47 |
|
48 |
model = AutoModelForCausalLM.from_pretrained(
|
49 |
model_name,
|
@@ -79,7 +79,7 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
79 |
|
80 |
Linear models offer a promising approach to significantly reduce computational costs at scale, particularly for large context lengths. Enabling a >1000x improvement in inference costs, enabling o1 inference time thinking and wider AI accessibility.
|
81 |
|
82 |
-
As demonstrated with our
|
83 |
|
84 |
As with our previous models, the model's inherent knowledge and dataset training are inherited from its "parent" model. Consequently, unlike previous RWKV models trained on over 100+ languages, the QRWKV model is limited to approximately 30 languages supported by the Qwen line of models.
|
85 |
|
|
|
7 |
|
8 |

|
9 |
|
10 |
+
- Try out the model on [](https://featherless.ai/models/featherless-ai/QRWKV-72B)
|
11 |
- Model details from our blog post here! [](https://substack.recursal.ai/p/qwerky-72b-and-32b-training-large)
|
12 |
- This model was presented in [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
|
13 |
|
14 |
+
Benchmarks is as follows for both QRWKV-QwQ-32B and QRWKV-72B models:
|
15 |
|
16 |
+
| Tasks | Metric | QRWKV-QwQ-32B | Qwen/QwQ-32B | QRWKV-72B | Qwen2.5-72B-Instruct |
|
17 |
|:---:|:---:|:---:|:---:|:---:|:---:|
|
18 |
| arc_challenge | acc_norm | **0.5640** | 0.5563 | **0.6382** | 0.6323 |
|
19 |
| arc_easy | acc_norm | 0.7837 | **0.7866** | **0.8443** | 0.8329 |
|
|
|
33 |
```py
|
34 |
# ...
|
35 |
|
36 |
+
model = AutoModelForCausalLM.from_pretrained("featherless-ai/QRWKV-72B", trust_remote_code=True)
|
37 |
|
38 |
# ...
|
39 |
```
|
|
|
43 |
```py
|
44 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
45 |
|
46 |
+
model_name = "featherless-ai/QRWKV-72B"
|
47 |
|
48 |
model = AutoModelForCausalLM.from_pretrained(
|
49 |
model_name,
|
|
|
79 |
|
80 |
Linear models offer a promising approach to significantly reduce computational costs at scale, particularly for large context lengths. Enabling a >1000x improvement in inference costs, enabling o1 inference time thinking and wider AI accessibility.
|
81 |
|
82 |
+
As demonstrated with our QRWKV-72B-Preview and prior models such as QRWKV6-32B Instruct Preview, we have successfully converted Qwen 2.5 72B into a RWKV variant without requiring a pretrain on the base model or retraining the model from scratch. Enabling us to test and validate the more efficient RWKV Linear attention with a much smaller budget. Since our preview, we have continued to refine our technique and managed to improve the model over the preview model iteration.
|
83 |
|
84 |
As with our previous models, the model's inherent knowledge and dataset training are inherited from its "parent" model. Consequently, unlike previous RWKV models trained on over 100+ languages, the QRWKV model is limited to approximately 30 languages supported by the Qwen line of models.
|
85 |
|