🧠 Kurdish NER with XLM-R

This is a fine-tuned xlm-roberta-base model for Named Entity Recognition (NER) in Kurmanji Kurdish. It was trained on a manually annotated dataset of over 8,000 sentences. The model identifies the following entity types:

PER: Person
LOC: Location
ORG: Organization

🤗 Model Details

Base model: xlm-roberta-base (270 M parameters)
Fine-tuning
- Epochs: 5
- Batch size: 16
- Max seq length: 128 tokens
- Optimizer: AdamW
- Learning rate: 2e-5
- Warmup steps: 500
- Weight decay: 0.01

🔍 Intended Use

Extract named entities from Kurmanji Kurdish text (news, social media, etc.)
Aid in information extraction, digital humanities, and low-resource language research

🧪 Evaluation Metrics

Test set: 1,630 sentences (≈26 k tokens)

Entity	Precision	Recall	F1 Score
PER	0.8719	0.8666	0.8692
LOC	0.8817	0.8825	0.8821
ORG	0.7280	0.7930	0.7591
Overall	0.8325	0.8511	0.8414

🌐 Try it Online

👉 Streamlit Demo
Paste a sentence in Kurmanji Kurdish (Latin script) and explore the model’s predictions in your browser.

🛠️ How to Use

You can also load and use the model via Hugging Face 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
model_id = "akam-ot/ku-ner-xlmr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

# Create NER pipeline
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example sentence
sentence = "Navê min Hejar e û ez li Hewlêr dijîm."

# Run NER
results = ner(sentence)

# Display results
for ent in results:
    print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")