ModernGBERT_1B / README.md
Julia287's picture
Update README.md
5630386 verified
---
datasets:
- togethercomputer/RedPajama-Data-V2
language:
- de
library_name: transformers
license: other
pipeline_tag: feature-extraction
tags:
- masked-lm
- long-context
- modernbert
---
# ModernGBERT 1B
ModernGBERT 1B is a German ModernBERT language model with 1 billion parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT [codebase](https://github.com/AnswerDotAI/ModernBERT).
ModernGBERT 1B has been pre-trained on the same 1.27 trillion tokens from the German portion of [RedPajama V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) as our [LLäMmlein](https://huggingface.co/collections/LSX-UniWue/llammlein-6732ff41f3705c686e605762) decoder family.
We provide two model sizes:
* [ModernGBERT 1B](https://huggingface.co/LSX-UniWue/ModernGBERT_1B) ← You are here
28 layers, hidden size 2,048, 1 billion parameters
* [ModernGBERT 134M](https://huggingface.co/LSX-UniWue/ModernGBERT_134M)
22 layers, hidden size 768, 134 million parameters
Find more details in our [preprint](https://arxiv.org/abs/2505.13136)!
### Usage
You can use ModernGBERT with the `transformers` library from version v4.48.0 onwards.
(Optional: install `flash-attn` to achieve highest efficiency.)
Since ModernGBERT 1B is a Masked Language Model (MLM), you can load it via `AutoModelForMaskedLM`. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes.
Example using `AutoModelForMaskedLM`:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "LSX-UniWue/ModernGBERT_1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "Die Hauptstadt von Frankreich ist [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token: Paris
```
**NOTE:** If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.:
```python
from peft import LoraConfig, get_peft_model
peft_config = LoraConfig(
task_type="TOKEN_CLS", r=8, lora_alpha=32,
target_modules=["Wqkv", "Wi", "Wo"],
)
model = get_peft_model(model, peft_config)
```
### Intermediate Checkpoints
In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository.
A specific checkpoint can be loaded like this:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "LSX-UniWue/ModernGBERT_1B"
revision = "base-head-12000-ckpt"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision)
```
### Performance
We evaluate our models across a broad range of tasks. For natural language understanding, we use the [SuperGLEBer](https://lsx-uniwue.github.io/SuperGLEBer-site/) benchmark, and for embedding capabilities, we use the [German MTEB](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28deu%2C+v1%29) benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our [preprint](https://arxiv.org/abs/2505.13136) for more details about the evaluation.
| Model | SuperGLEBer Avg | MTEB Avg |
|----------------------------------|-----------------|-----------|
| ModernGBERT 1B<br>(you are here) | **0.808** | **0.551** |
| ModernGBERT 134M | 0.749 | 0.501 |
| GBERT-base | 0.718 | 0.500 |
| GBERT-large | 0.768 | 0.521 |
| GeBERTa-base | 0.716 | 0.493 |
| GeBERTa-large | 0.749 | 0.494 |
| GeBERTa-xlarge | 0.767 | 0.521 |
| Gerturax-3 | 0.740 | 0.472 |
| XLM-RoBERTa-large | 0.730 | 0.460 |
| XLM-RoBERTa-xlarge | 0.758 | 0.479 |
### License
We release the ModernGBERT models under a research-only RAIL-M license. See [license.md](./license.md) for details.