Update README.md
Browse filesfeel free to propose further changes to this draft
README.md
CHANGED
@@ -14,19 +14,79 @@ tags:
|
|
14 |
|
15 |
# ModernGBERT 1B
|
16 |
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
Find more details in our [preprint](https://arxiv.org/abs/2505.13136)!
|
19 |
|
|
|
20 |
### Usage
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
```python
|
23 |
-
from transformers import
|
24 |
|
25 |
-
|
|
|
|
|
26 |
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
```
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
### Performance
|
32 |
-
We
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
# ModernGBERT 1B
|
16 |
|
17 |
+
ModernGBERT 1B is a German ModernBERT language model with 1 billion parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT [codebase](https://github.com/AnswerDotAI/ModernBERT).
|
18 |
+
ModernGBERT 1B has been pre-trained on the same 1.27 trillion tokens from the German portion of [RedPajama V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) as our [LLäMmlein](https://huggingface.co/collections/LSX-UniWue/llammlein-6732ff41f3705c686e605762) decoder family.
|
19 |
+
|
20 |
+
We provide two model sizes:
|
21 |
+
|
22 |
+
* [ModernGBERT 1B](https://huggingface.co/LSX-UniWue/ModernGBERT_1B) ← You are here
|
23 |
+
28 layers, hidden size 2,048, 1 billion parameters
|
24 |
+
|
25 |
+
* [ModernGBERT 134M](https://huggingface.co/LSX-UniWue/ModernGBERT_134M)
|
26 |
+
22 layers, hidden size 768, 134 million parameters
|
27 |
+
|
28 |
Find more details in our [preprint](https://arxiv.org/abs/2505.13136)!
|
29 |
|
30 |
+
|
31 |
### Usage
|
32 |
|
33 |
+
You can use ModernGBERT with the `transformers` library from version v4.48.0 onwards.
|
34 |
+
(Optional: install `flash-attn` to achieve highest efficiency.)
|
35 |
+
|
36 |
+
Since ModernGBERT 1B is a Masked Language Model (MLM), you can load it via `AutoModelForMaskedLM`. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes.
|
37 |
+
|
38 |
+
Example using `AutoModelForMaskedLM`:
|
39 |
+
|
40 |
```python
|
41 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
42 |
|
43 |
+
model_id = "LSX-UniWue/ModernGBERT_1B"
|
44 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
45 |
+
model = AutoModelForMaskedLM.from_pretrained(model_id)
|
46 |
|
47 |
+
text = "Die Hauptstadt von Frankreich ist [MASK]."
|
48 |
+
inputs = tokenizer(text, return_tensors="pt")
|
49 |
+
outputs = model(**inputs)
|
50 |
+
|
51 |
+
# To get predictions for the mask:
|
52 |
+
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
|
53 |
+
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
|
54 |
+
predicted_token = tokenizer.decode(predicted_token_id)
|
55 |
+
print("Predicted token:", predicted_token)
|
56 |
+
# Predicted token: Paris
|
57 |
```
|
58 |
|
59 |
+
**NOTE:** If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.:
|
60 |
+
|
61 |
+
```python
|
62 |
+
from peft import LoraConfig, get_peft_model
|
63 |
+
peft_config = LoraConfig(
|
64 |
+
task_type="TOKEN_CLS", r=8, lora_alpha=32,
|
65 |
+
target_modules=["Wqkv", "Wi", "Wo"],
|
66 |
+
)
|
67 |
+
model = get_peft_model(model, peft_config)
|
68 |
+
```
|
69 |
+
|
70 |
+
|
71 |
|
72 |
### Performance
|
73 |
+
We evaluate our models across a broad range of tasks. For natural language understanding, we use the [SuperGLEBer](https://lsx-uniwue.github.io/SuperGLEBer-site/) benchmark, and for embedding capabilities, we use the [German MTEB](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28deu%2C+v1%29) benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our [preprint](https://arxiv.org/abs/2505.13136) for more details about the evaluation.
|
74 |
+
|
75 |
+
| Model | SuperGLEBer Avg | MTEB Avg |
|
76 |
+
|----------------------------------|-----------------|-----------|
|
77 |
+
| ModernGBERT 1B<br>(you are here) | **0.808** | **0.551** |
|
78 |
+
| ModernGBERT 134M | 0.749 | 0.501 |
|
79 |
+
| GBERT-base | 0.718 | 0.500 |
|
80 |
+
| GBERT-large | 0.768 | 0.521 |
|
81 |
+
| GeBERTa-base | 0.716 | 0.493 |
|
82 |
+
| GeBERTa-large | 0.749 | 0.494 |
|
83 |
+
| GeBERTa-xlarge | 0.767 | 0.521 |
|
84 |
+
| Gerturax-3 | 0.740 | 0.472 |
|
85 |
+
| XLM-RoBERTa-large | 0.730 | 0.460 |
|
86 |
+
| XLM-RoBERTa-xlarge | 0.758 | 0.479 |
|
87 |
+
|
88 |
+
|
89 |
+
|
90 |
+
### License
|
91 |
+
|
92 |
+
We release the ModernGBERT models under a research-only RAIL-M license. See [license.md](./license.md) for details.
|