A Named Entity Recognition Model for Kazakh
How to use
You can use this model with the Transformers pipeline for NER.
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")
model = AutoModelForTokenClassification.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "none")
example = "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."
ner_results = nlp(example)
for result in ner_results:
print(result)
token = ""
label_list = []
token_list = []
for result in ner_results:
if result["word"].startswith("▁"):
if token:
token_list.append(token.replace("▁", ""))
token = result["word"]
label_list.append(result["entity"])
else:
token += result["word"]
token_list.append(token.replace("▁", ""))
for token, label in zip(token_list, label_list):
print(f"{token}\t{label}")
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "simple")
example = "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."
ner_results = nlp(example)
for result in ner_results:
print(result)
Evaluation results on the validation and test sets
|
Validation set |
|
|
Test set |
|
Precision |
Recall |
F1-score |
Precision |
Recall |
F1-score |
96.58% |
96.66% |
96.62% |
96.49% |
96.86% |
96.67% |
Model performance for the NE classes of the validation set
NE Class |
Precision |
Recall |
F1-score |
Support |
ADAGE |
90.00% |
47.37% |
62.07% |
19 |
ART |
91.36% |
95.48% |
93.38% |
155 |
CARDINAL |
98.44% |
98.37% |
98.40% |
2,878 |
CONTACT |
100.00% |
83.33% |
90.91% |
18 |
DATE |
97.38% |
97.27% |
97.33% |
2,603 |
DISEASE |
96.72% |
97.52% |
97.12% |
121 |
EVENT |
83.24% |
93.51% |
88.07% |
154 |
FACILITY |
68.95% |
84.83% |
76.07% |
178 |
GPE |
98.46% |
96.50% |
97.47% |
1,656 |
LANGUAGE |
95.45% |
89.36% |
92.31% |
47 |
LAW |
87.50% |
87.50% |
87.50% |
56 |
LOCATION |
92.49% |
93.81% |
93.14% |
210 |
MISCELLANEOUS |
100.00% |
76.92% |
86.96% |
26 |
MONEY |
99.56% |
100.00% |
99.78% |
455 |
NON_HUMAN |
0.00% |
0.00% |
0.00% |
1 |
NORP |
95.71% |
95.45% |
95.58% |
374 |
ORDINAL |
98.14% |
95.84% |
96.98% |
385 |
ORGANISATION |
92.19% |
90.97% |
91.58% |
753 |
PERCENTAGE |
99.08% |
99.08% |
99.08% |
437 |
PERSON |
98.47% |
98.72% |
98.60% |
1,175 |
POSITION |
96.15% |
97.79% |
96.96% |
587 |
PRODUCT |
89.06% |
78.08% |
83.21% |
73 |
PROJECT |
92.13% |
95.22% |
93.65% |
209 |
QUANTITY |
97.58% |
98.30% |
97.94% |
411 |
TIME |
94.81% |
96.63% |
95.71% |
208 |
micro avg |
96.58% |
96.66% |
96.62% |
13,189 |
macro avg |
90.12% |
87.51% |
88.39% |
13,189 |
weighted avg |
96.67% |
96.66% |
96.63% |
13,189 |
Model performance for the NE classes of the test set
NE Class |
Precision |
Recall |
F1-score |
Support |
ADAGE |
71.43% |
29.41% |
41.67% |
17 |
ART |
95.71% |
96.89% |
96.30% |
161 |
CARDINAL |
98.43% |
98.60% |
98.51% |
2,789 |
CONTACT |
94.44% |
85.00% |
89.47% |
20 |
DATE |
96.59% |
97.60% |
97.09% |
2,584 |
DISEASE |
87.69% |
95.80% |
91.57% |
119 |
EVENT |
86.67% |
92.86% |
89.66% |
154 |
FACILITY |
74.88% |
81.73% |
78.16% |
197 |
GPE |
98.57% |
97.81% |
98.19% |
1,691 |
LANGUAGE |
90.70% |
95.12% |
92.86% |
41 |
LAW |
93.33% |
76.36% |
84.00% |
55 |
LOCATION |
92.08% |
89.42% |
90.73% |
208 |
MISCELLANEOUS |
86.21% |
96.15% |
90.91% |
26 |
MONEY |
100.00% |
100.00% |
100.00% |
427 |
NON_HUMAN |
0.00% |
0.00% |
0.00% |
1 |
NORP |
99.46% |
99.18% |
99.32% |
368 |
ORDINAL |
96.63% |
97.64% |
97.14% |
382 |
ORGANISATION |
90.97% |
91.23% |
91.10% |
718 |
PERCENTAGE |
98.05% |
98.05% |
98.05% |
462 |
PERSON |
98.70% |
99.13% |
98.92% |
1,151 |
POSITION |
96.36% |
97.65% |
97.00% |
597 |
PRODUCT |
89.23% |
77.33% |
82.86% |
75 |
PROJECT |
93.69% |
93.69% |
93.69% |
206 |
QUANTITY |
97.26% |
97.02% |
97.14% |
403 |
TIME |
94.95% |
94.09% |
94.52% |
220 |
micro avg |
96.54% |
96.85% |
96.69% |
13,072 |
macro avg |
88.88% |
87.11% |
87.55% |
13,072 |
weighted avg |
96.55% |
96.85% |
96.67% |
13,072 |