🇩🇪 GermaNER: Adapter-Based NER for German using XLM-RoBERTa

🔍 Overview

GermaNER is a high-performance Named Entity Recognition (NER) model built on top of xlm-roberta-large and fine-tuned using the PEFT framework with LoRA adapters. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.

This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.

🧠 Architecture

Base model: xlm-roberta-large
Fine-tuning: Parameter-Efficient Fine-Tuning (PEFT) using LoRA
Adapter config:
- r=16, alpha=32, dropout=0.1
- LoRA applied to: query, key, value projection layers
Max sequence length: 128 tokens
Mixed-precision training: (fp16)
Training samples: 44,000 sentences

🏷️ Label Schema

The model uses the standard BIO format with the following 7 labels:

Label	Description
`O`	Outside any named entity
`B-PER`	Beginning of a person entity
`I-PER`	Inside a person entity
`B-ORG`	Beginning of an organization
`I-ORG`	Inside an organization
`B-LOC`	Beginning of a location entity
`I-LOC`	Inside a location entity

🗂️ Training-Set Concatenation

The model was trained on a concatenated corpus of GermEval 2014 and WikiANN-de:

Split	Sentences
Training	44 000
Evaluation	15 100

The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.

🚀 Getting Started

This model uses adapter-based inference, not a full model. Use peft to attach the adapter weights.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from peft import PeftModel, PeftConfig

model_id = "fau/GermaNER"

# Define label mappings
label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
label2id = {label: idx for idx, label in enumerate(label_names)}
id2label = {idx: label for idx, label in enumerate(label_names)}

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

# Load PEFT adapter config
peft_config = PeftConfig.from_pretrained(model_id, token=True)

# Load base model with label mappings
base_model = AutoModelForTokenClassification.from_pretrained(
    peft_config.base_model_name_or_path,
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id,
    token=True
)

# Attach adapter
model = PeftModel.from_pretrained(base_model, model_id, token=True)

# Create pipeline
ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Run inference
text = "Angela Merkel war Bundeskanzlerin von Deutschland."
entities = ner_pipe(text)

for ent in entities:
    print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")

Files & Structure

File	Description
adapter_model.safetensors	LoRA adapter weights
adapter_config.json	PEFT config for the adapter
tokenizer.json	Tokenizer for XLM-Roberta
sentencepiece.bpe.model	SentencePiece model file
special_tokens_map.json	Special tokens config
tokenizer_config.json	Tokenizer settings

💡 Open-Source Use Cases (Hugging Face)

Streaming news pipelines – Deploy transformers NER via the pipeline("ner") API inside a Kafka → Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboards—all built from OSS components.
Parliament analytics – Load Bundestag & Länder transcripts with datasets.load_dataset, tag entities in batch with a TokenClassificationPipeline, then export triples to Neo4j via the OSS graphdatascience driver and expose them through a GraphQL layer.
Biomedical text mining – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflow—entirely with Apache-licensed libraries.
Conversational AI – Attach the LoRA adapter with PeftModel and serve through the HF text-classification-inference server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.

📜 License
This model is licensed under the Apache 2.0 License.

For questions, reach out on GitHub or Hugging Face 🤝

Open source contributions are welcome via:

A demo.ipynb notebook
An evaluation script using seqeval
A gr.Interface or Streamlit demo for public inference