πŸ‡©πŸ‡ͺ GermaNER: Adapter-Based NER for German using XLM-RoBERTa

NER Logo

πŸ” Overview

GermaNER is a high-performance Named Entity Recognition (NER) model built on top of xlm-roberta-large and fine-tuned using the PEFT framework with LoRA adapters. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.

This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.


🧠 Architecture

  • Base model: xlm-roberta-large
  • Fine-tuning: Parameter-Efficient Fine-Tuning (PEFT) using LoRA
  • Adapter config:
    • r=16, alpha=32, dropout=0.1
    • LoRA applied to: query, key, value projection layers
  • Max sequence length: 128 tokens
  • Mixed-precision training: (fp16)
  • Training samples: 44,000 sentences

🏷️ Label Schema

The model uses the standard BIO format with the following 7 labels:

Label Description
O Outside any named entity
B-PER Beginning of a person entity
I-PER Inside a person entity
B-ORG Beginning of an organization
I-ORG Inside an organization
B-LOC Beginning of a location entity
I-LOC Inside a location entity

πŸ—‚οΈ Training-Set Concatenation

The model was trained on a concatenated corpus of GermEval 2014 and WikiANN-de:

Split Sentences
Training 44 000
Evaluation 15 100

The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.

πŸš€ Getting Started

This model uses adapter-based inference, not a full model. Use peft to attach the adapter weights.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from peft import PeftModel, PeftConfig

model_id = "fau/GermaNER"

# Define label mappings
label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
label2id = {label: idx for idx, label in enumerate(label_names)}
id2label = {idx: label for idx, label in enumerate(label_names)}

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

# Load PEFT adapter config
peft_config = PeftConfig.from_pretrained(model_id, token=True)

# Load base model with label mappings
base_model = AutoModelForTokenClassification.from_pretrained(
    peft_config.base_model_name_or_path,
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id,
    token=True
)

# Attach adapter
model = PeftModel.from_pretrained(base_model, model_id, token=True)

# Create pipeline
ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Run inference
text = "Angela Merkel war Bundeskanzlerin von Deutschland."
entities = ner_pipe(text)

for ent in entities:
    print(f"{ent['word']} β†’ {ent['entity_group']} (score: {ent['score']:.2f})")

Files & Structure

File Description
adapter_model.safetensors LoRA adapter weights
adapter_config.json PEFT config for the adapter
tokenizer.json Tokenizer for XLM-Roberta
sentencepiece.bpe.model SentencePiece model file
special_tokens_map.json Special tokens config
tokenizer_config.json Tokenizer settings

πŸ’‘ Open-Source Use Cases (Hugging Face)

  • Streaming news pipelines – Deploy transformers NER via the pipeline("ner") API inside a Kafka β†’ Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboardsβ€”all built from OSS components.

  • Parliament analytics – Load Bundestag & LΓ€nder transcripts with datasets.load_dataset, tag entities in batch with a TokenClassificationPipeline, then export triples to Neo4j via the OSS graphdatascience driver and expose them through a GraphQL layer.

  • Biomedical text mining – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflowβ€”entirely with Apache-licensed libraries.

  • Conversational AI – Attach the LoRA adapter with PeftModel and serve through the HF text-classification-inference server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.

πŸ“œ License
This model is licensed under the Apache 2.0 License.

For questions, reach out on GitHub or Hugging Face 🀝

Open source contributions are welcome via:

  • A demo.ipynb notebook
  • An evaluation script using seqeval
  • A gr.Interface or Streamlit demo for public inference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support