Model Description
This model fine-tuned xlm-roberta-base on English and Turkish PAN-X data from the google/xtreme.
Label Scheme
- O means the word doesn’t correspond to any entity.
- B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
- B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
- B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity. source
Evaluation
Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|---|---|
No log | 1.0 | 417 | 0.1159 | 0.9689 | 0.9042 | 0.9274 | 0.9157 |
0.0895 | 2.0 | 834 | 0.1148 | 0.9707 | 0.9185 | 0.9228 | 0.9207 |
0.0895 | 3.0 | 1251 | 0.1209 | 0.9714 | 0.9171 | 0.9311 | 0.9241 |
0.0485 | 4.0 | 1668 | 0.1222 | 0.9725 | 0.9212 | 0.9335 | 0.9273 |
The model demonstrates strong generalization without significant overfitting.
Usage Example
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model = AutoModelForTokenClassification.from_pretrained("mehmet0sahinn/xlm-roberta-base-cased-ner-turkish")
tokenizer = AutoTokenizer.from_pretrained("mehmet0sahinn/xlm-roberta-base-cased-ner-turkish")
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Mustafa Kemal Atatürk 1881 yılında Selanik'te doğdu."
ner_results = nlp(text)
for entity in ner_results:
print(entity)
Dataset
- Source: PAN-X from the google/xtreme.
- Languages: English, Turkish
- Training size 20K (EN) + 20K (TR) rows
- Validation size 10K (EN) + 10K (TR)
- Test size 10K (EN) + 10K (TR)
Links
- Downloads last month
- 6
Model tree for mehmet0sahinn/xlm-roberta-base-cased-ner-turkish
Base model
FacebookAI/xlm-roberta-base