|
--- |
|
datasets: |
|
- togethercomputer/RedPajama-Data-V2 |
|
language: |
|
- de |
|
library_name: transformers |
|
license: other |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- masked-lm |
|
- long-context |
|
- modernbert |
|
--- |
|
|
|
# ModernGBERT 1B |
|
|
|
ModernGBERT 1B is a German ModernBERT language model with 1 billion parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT [codebase](https://github.com/AnswerDotAI/ModernBERT). |
|
ModernGBERT 1B has been pre-trained on the same 1.27 trillion tokens from the German portion of [RedPajama V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) as our [LLäMmlein](https://huggingface.co/collections/LSX-UniWue/llammlein-6732ff41f3705c686e605762) decoder family. |
|
|
|
We provide two model sizes: |
|
|
|
* [ModernGBERT 1B](https://huggingface.co/LSX-UniWue/ModernGBERT_1B) ← You are here |
|
28 layers, hidden size 2,048, 1 billion parameters |
|
|
|
* [ModernGBERT 134M](https://huggingface.co/LSX-UniWue/ModernGBERT_134M) |
|
22 layers, hidden size 768, 134 million parameters |
|
|
|
Find more details in our [preprint](https://arxiv.org/abs/2505.13136)! |
|
|
|
|
|
### Usage |
|
|
|
You can use ModernGBERT with the `transformers` library from version v4.48.0 onwards. |
|
(Optional: install `flash-attn` to achieve highest efficiency.) |
|
|
|
Since ModernGBERT 1B is a Masked Language Model (MLM), you can load it via `AutoModelForMaskedLM`. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes. |
|
|
|
Example using `AutoModelForMaskedLM`: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
model_id = "LSX-UniWue/ModernGBERT_1B" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForMaskedLM.from_pretrained(model_id) |
|
|
|
text = "Die Hauptstadt von Frankreich ist [MASK]." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
# To get predictions for the mask: |
|
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) |
|
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) |
|
predicted_token = tokenizer.decode(predicted_token_id) |
|
print("Predicted token:", predicted_token) |
|
# Predicted token: Paris |
|
``` |
|
|
|
**NOTE:** If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.: |
|
|
|
```python |
|
from peft import LoraConfig, get_peft_model |
|
peft_config = LoraConfig( |
|
task_type="TOKEN_CLS", r=8, lora_alpha=32, |
|
target_modules=["Wqkv", "Wi", "Wo"], |
|
) |
|
model = get_peft_model(model, peft_config) |
|
``` |
|
|
|
### Intermediate Checkpoints |
|
In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository. |
|
A specific checkpoint can be loaded like this: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
model_id = "LSX-UniWue/ModernGBERT_1B" |
|
revision = "base-head-12000-ckpt" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision) |
|
model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision) |
|
``` |
|
|
|
### Performance |
|
We evaluate our models across a broad range of tasks. For natural language understanding, we use the [SuperGLEBer](https://lsx-uniwue.github.io/SuperGLEBer-site/) benchmark, and for embedding capabilities, we use the [German MTEB](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28deu%2C+v1%29) benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our [preprint](https://arxiv.org/abs/2505.13136) for more details about the evaluation. |
|
|
|
| Model | SuperGLEBer Avg | MTEB Avg | |
|
|----------------------------------|-----------------|-----------| |
|
| ModernGBERT 1B<br>(you are here) | **0.808** | **0.551** | |
|
| ModernGBERT 134M | 0.749 | 0.501 | |
|
| GBERT-base | 0.718 | 0.500 | |
|
| GBERT-large | 0.768 | 0.521 | |
|
| GeBERTa-base | 0.716 | 0.493 | |
|
| GeBERTa-large | 0.749 | 0.494 | |
|
| GeBERTa-xlarge | 0.767 | 0.521 | |
|
| Gerturax-3 | 0.740 | 0.472 | |
|
| XLM-RoBERTa-large | 0.730 | 0.460 | |
|
| XLM-RoBERTa-xlarge | 0.758 | 0.479 | |
|
|
|
|
|
|
|
### License |
|
|
|
We release the ModernGBERT models under a research-only RAIL-M license. See [license.md](./license.md) for details. |
|
|