ModernGBERT_1B / README.md

Update README.md

5630386 verified about 1 month ago

4.65 kB

	---
	datasets:
	- togethercomputer/RedPajama-Data-V2
	language:
	- de
	library_name: transformers
	license: other
	pipeline_tag: feature-extraction
	tags:
	- masked-lm
	- long-context
	- modernbert
	---

	# ModernGBERT 1B

	ModernGBERT 1B is a German ModernBERT language model with 1 billion parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT [codebase](https://github.com/AnswerDotAI/ModernBERT).
	ModernGBERT 1B has been pre-trained on the same 1.27 trillion tokens from the German portion of [RedPajama V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) as our [LLäMmlein](https://huggingface.co/collections/LSX-UniWue/llammlein-6732ff41f3705c686e605762) decoder family.

	We provide two model sizes:

	* [ModernGBERT 1B](https://huggingface.co/LSX-UniWue/ModernGBERT_1B) ← You are here
	28 layers, hidden size 2,048, 1 billion parameters

	* [ModernGBERT 134M](https://huggingface.co/LSX-UniWue/ModernGBERT_134M)
	22 layers, hidden size 768, 134 million parameters

	Find more details in our [preprint](https://arxiv.org/abs/2505.13136)!


	### Usage

	You can use ModernGBERT with the `transformers` library from version v4.48.0 onwards.
	(Optional: install `flash-attn` to achieve highest efficiency.)

	Since ModernGBERT 1B is a Masked Language Model (MLM), you can load it via `AutoModelForMaskedLM`. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes.

	Example using `AutoModelForMaskedLM`:

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	model_id = "LSX-UniWue/ModernGBERT_1B"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(model_id)

	text = "Die Hauptstadt von Frankreich ist [MASK]."
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	# To get predictions for the mask:
	masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
	predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
	predicted_token = tokenizer.decode(predicted_token_id)
	print("Predicted token:", predicted_token)
	# Predicted token: Paris
	```

	NOTE: If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.:

	```python
	from peft import LoraConfig, get_peft_model
	peft_config = LoraConfig(
	task_type="TOKEN_CLS", r=8, lora_alpha=32,
	target_modules=["Wqkv", "Wi", "Wo"],
	)
	model = get_peft_model(model, peft_config)
	```

	### Intermediate Checkpoints
	In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository.
	A specific checkpoint can be loaded like this:

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	model_id = "LSX-UniWue/ModernGBERT_1B"
	revision = "base-head-12000-ckpt"
	tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
	model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision)
	```

	### Performance
	We evaluate our models across a broad range of tasks. For natural language understanding, we use the [SuperGLEBer](https://lsx-uniwue.github.io/SuperGLEBer-site/) benchmark, and for embedding capabilities, we use the [German MTEB](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28deu%2C+v1%29) benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our [preprint](https://arxiv.org/abs/2505.13136) for more details about the evaluation.

	\| Model \| SuperGLEBer Avg \| MTEB Avg \|
	\|----------------------------------\|-----------------\|-----------\|
	\| ModernGBERT 1B<br>(you are here) \| 0.808 \| 0.551 \|
	\| ModernGBERT 134M \| 0.749 \| 0.501 \|
	\| GBERT-base \| 0.718 \| 0.500 \|
	\| GBERT-large \| 0.768 \| 0.521 \|
	\| GeBERTa-base \| 0.716 \| 0.493 \|
	\| GeBERTa-large \| 0.749 \| 0.494 \|
	\| GeBERTa-xlarge \| 0.767 \| 0.521 \|
	\| Gerturax-3 \| 0.740 \| 0.472 \|
	\| XLM-RoBERTa-large \| 0.730 \| 0.460 \|
	\| XLM-RoBERTa-xlarge \| 0.758 \| 0.479 \|



	### License

	We release the ModernGBERT models under a research-only RAIL-M license. See [license.md](./license.md) for details.