π©πͺ GermaNER: Adapter-Based NER for German using XLM-RoBERTa

π Overview
GermaNER is a high-performance Named Entity Recognition (NER) model built on top of xlm-roberta-large
and fine-tuned using the PEFT framework with LoRA adapters. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.
This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.
π§ Architecture
- Base model:
xlm-roberta-large
- Fine-tuning: Parameter-Efficient Fine-Tuning (PEFT) using LoRA
- Adapter config:
r=16
,alpha=32
,dropout=0.1
- LoRA applied to:
query
,key
,value
projection layers
- Max sequence length: 128 tokens
- Mixed-precision training: (fp16)
- Training samples: 44,000 sentences
π·οΈ Label Schema
The model uses the standard BIO format with the following 7 labels:
Label | Description |
---|---|
O |
Outside any named entity |
B-PER |
Beginning of a person entity |
I-PER |
Inside a person entity |
B-ORG |
Beginning of an organization |
I-ORG |
Inside an organization |
B-LOC |
Beginning of a location entity |
I-LOC |
Inside a location entity |
ποΈ Training-Set Concatenation
The model was trained on a concatenated corpus of GermEval 2014 and WikiANN-de:
Split | Sentences |
---|---|
Training | 44 000 |
Evaluation | 15 100 |
The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.
π Getting Started
This model uses adapter-based inference, not a full model. Use peft
to attach the adapter weights.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from peft import PeftModel, PeftConfig
model_id = "fau/GermaNER"
# Define label mappings
label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
label2id = {label: idx for idx, label in enumerate(label_names)}
id2label = {idx: label for idx, label in enumerate(label_names)}
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
# Load PEFT adapter config
peft_config = PeftConfig.from_pretrained(model_id, token=True)
# Load base model with label mappings
base_model = AutoModelForTokenClassification.from_pretrained(
peft_config.base_model_name_or_path,
num_labels=len(label_names),
id2label=id2label,
label2id=label2id,
token=True
)
# Attach adapter
model = PeftModel.from_pretrained(base_model, model_id, token=True)
# Create pipeline
ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Run inference
text = "Angela Merkel war Bundeskanzlerin von Deutschland."
entities = ner_pipe(text)
for ent in entities:
print(f"{ent['word']} β {ent['entity_group']} (score: {ent['score']:.2f})")
Files & Structure
File | Description |
---|---|
adapter_model.safetensors | LoRA adapter weights |
adapter_config.json | PEFT config for the adapter |
tokenizer.json | Tokenizer for XLM-Roberta |
sentencepiece.bpe.model | SentencePiece model file |
special_tokens_map.json | Special tokens config |
tokenizer_config.json | Tokenizer settings |
π‘ Open-Source Use Cases (Hugging Face)
Streaming news pipelines β Deploy
transformers
NER via thepipeline("ner")
API inside a Kafka β Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboardsβall built from OSS components.Parliament analytics β Load Bundestag & LΓ€nder transcripts with
datasets.load_dataset
, tag entities in batch with aTokenClassificationPipeline
, then export triples to Neo4j via the OSSgraphdatascience
driver and expose them through a GraphQL layer.Biomedical text mining β Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflowβentirely with Apache-licensed libraries.
Conversational AI β Attach the LoRA adapter with
PeftModel
and serve through the HFtext-classification-inference
server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.
π License
This model is licensed under the Apache 2.0 License.
For questions, reach out on GitHub or Hugging Face π€
Open source contributions are welcome via:
- A
demo.ipynb
notebook - An evaluation script using
seqeval
- A
gr.Interface
or Streamlit demo for public inference