--- datasets: - togethercomputer/RedPajama-Data-V2 language: - de library_name: transformers license: other pipeline_tag: feature-extraction tags: - masked-lm - long-context - modernbert --- # ModernGBERT 1B ModernGBERT 1B is a German ModernBERT language model with 1 billion parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT [codebase](https://github.com/AnswerDotAI/ModernBERT). ModernGBERT 1B has been pre-trained on the same 1.27 trillion tokens from the German portion of [RedPajama V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) as our [LLäMmlein](https://huggingface.co/collections/LSX-UniWue/llammlein-6732ff41f3705c686e605762) decoder family. We provide two model sizes: * [ModernGBERT 1B](https://huggingface.co/LSX-UniWue/ModernGBERT_1B) ← You are here 28 layers, hidden size 2,048, 1 billion parameters * [ModernGBERT 134M](https://huggingface.co/LSX-UniWue/ModernGBERT_134M) 22 layers, hidden size 768, 134 million parameters Find more details in our [preprint](https://arxiv.org/abs/2505.13136)! ### Usage You can use ModernGBERT with the `transformers` library from version v4.48.0 onwards. (Optional: install `flash-attn` to achieve highest efficiency.) Since ModernGBERT 1B is a Masked Language Model (MLM), you can load it via `AutoModelForMaskedLM`. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes. Example using `AutoModelForMaskedLM`: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "LSX-UniWue/ModernGBERT_1B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) text = "Die Hauptstadt von Frankreich ist [MASK]." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # To get predictions for the mask: masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) predicted_token = tokenizer.decode(predicted_token_id) print("Predicted token:", predicted_token) # Predicted token: Paris ``` **NOTE:** If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.: ```python from peft import LoraConfig, get_peft_model peft_config = LoraConfig( task_type="TOKEN_CLS", r=8, lora_alpha=32, target_modules=["Wqkv", "Wi", "Wo"], ) model = get_peft_model(model, peft_config) ``` ### Intermediate Checkpoints In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository. A specific checkpoint can be loaded like this: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "LSX-UniWue/ModernGBERT_1B" revision = "base-head-12000-ckpt" tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision) model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision) ``` ### Performance We evaluate our models across a broad range of tasks. For natural language understanding, we use the [SuperGLEBer](https://lsx-uniwue.github.io/SuperGLEBer-site/) benchmark, and for embedding capabilities, we use the [German MTEB](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28deu%2C+v1%29) benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our [preprint](https://arxiv.org/abs/2505.13136) for more details about the evaluation. | Model | SuperGLEBer Avg | MTEB Avg | |----------------------------------|-----------------|-----------| | ModernGBERT 1B
(you are here) | **0.808** | **0.551** | | ModernGBERT 134M | 0.749 | 0.501 | | GBERT-base | 0.718 | 0.500 | | GBERT-large | 0.768 | 0.521 | | GeBERTa-base | 0.716 | 0.493 | | GeBERTa-large | 0.749 | 0.494 | | GeBERTa-xlarge | 0.767 | 0.521 | | Gerturax-3 | 0.740 | 0.472 | | XLM-RoBERTa-large | 0.730 | 0.460 | | XLM-RoBERTa-xlarge | 0.758 | 0.479 | ### License We release the ModernGBERT models under a research-only RAIL-M license. See [license.md](./license.md) for details.