TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Abstract
TurBLiMP evaluates language models' linguistic abilities using 16 phenomena across 1000 minimal pairs, highlighting challenges for LMs in Turkish syntax, word order, and morphology.
We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.
Community
Hi,
I've just tried out the evaluation code from the official repo and extended it to perform an overall evaluation of my current Turkish Language Models.
Here are the results:
Phenomenon | dbmdz/electra-small-turkish-cased-generator |
dbmdz/electra-base-turkish-cased-generator |
dbmdz/electra-base-turkish-mc4-cased-generator |
dbmdz/electra-base-turkish-mc4-uncased-generator |
dbmdz/bert-base-turkish-cased |
dbmdz/bert-base-turkish-uncased |
dbmdz/bert-base-turkish-128k-cased |
dbmdz/bert-base-turkish-128k-uncased |
dbmdz/distilbert-base-turkish-cased |
dbmdz/convbert-base-turkish-cased |
dbmdz/convbert-base-turkish-mc4-cased |
dbmdz/convbert-base-turkish-mc4-uncased |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Anaphor Agreement | 74.1 | 94.3 | 94.3 | 92.8 | 96.7 | 97.3 | 97.3 | 97.7 | 96.9 | 58.1 | 44.3 | 44.6 |
Argument Str. Tran. | 86.6 | 99.6 | 99.4 | 98.7 | 99.7 | 99.6 | 99.8 | 99.1 | 97.5 | 51.9 | 58.1 | 51.3 |
Argument Str. Ditr. | 79.3 | 96.1 | 95.5 | 95.2 | 99.8 | 96.1 | 96.1 | 96.1 | 95.4 | 64.6 | 58.6 | 64.5 |
Binding | 70.7 | 96.2 | 91.4 | 89.6 | 99.9 | 98.5 | 97.7 | 99 | 93 | 89.1 | 49.4 | 78.4 |
Determiners | 91.8 | 99.3 | 98.2 | 99.1 | 99.9 | 100 | 99 | 99.3 | 82.9 | 0 | 0 | 0 |
Ellipsis | 10.6 | 49.7 | 46.3 | 49 | 87.4 | 73.6 | 96.6 | 87.5 | 13.6 | 54.7 | 57.8 | 67.9 |
Irregular Forms | 98.7 | 97.9 | 99 | 99.8 | 98.8 | 100 | 99.9 | 99.6 | 94.1 | 82.9 | 86.6 | 95.2 |
Island Effects | 39.1 | 35.3 | 41.8 | 44 | 49.4 | 39.8 | 60.9 | 51.2 | 47.4 | 96.7 | 99.4 | 100 |
Nominalization | 90 | 96.6 | 97 | 95.4 | 97.4 | 97 | 98.9 | 97.4 | 95.6 | 55.2 | 59.2 | 60.6 |
NPI Licensing | 90.9 | 96.1 | 95 | 98 | 98.2 | 97.6 | 97.2 | 95 | 92.1 | 82.1 | 95.6 | 71.9 |
Passives | 100 | 91.2 | 93.6 | 91.6 | 82.2 | 78.1 | 84.4 | 81.3 | 98.8 | 100 | 100 | 99 |
Quantifiers | 97.9 | 98 | 98 | 97.6 | 95.7 | 94.6 | 98 | 98.4 | 98.4 | 99 | 99 | 99 |
Relative Clauses | 79.9 | 90.7 | 92 | 91.6 | 97.7 | 97.5 | 97 | 98.5 | 92 | 53.4 | 53.7 | 56.9 |
Scrambling | 99.5 | 100 | 100 | 99.8 | 100 | 100 | 99.6 | 100 | 99.8 | 38.7 | 59.3 | 63.3 |
Subject Agreement | 82.8 | 99 | 97.2 | 96.1 | 98.3 | 99.2 | 99.1 | 98.8 | 97 | 47.7 | 43.9 | 56.4 |
Suspended Affixation | 97.5 | 99 | 99.1 | 98.8 | 100 | 100 | 100 | 100 | 100 | 25.4 | 12.8 | 23.2 |
Model Average | 80.6 | 89.9 | 89.9 | 89.8 | 93.8 | 91.8 | 95.1 | 93.7 | 87.2 | 62.5 | 61.1 | 64.5 |
So I highly recommend to try out the dbmdz/bert-base-turkish-128k-cased
model as well, it is achieving new SOTA on that benchmark! I will release model outputs and my evaluation code soon on the Model Hub :)
/cc @ezgibasar , @fpadovani , @jumelet and @arianna-bis
Hi,
Thank you for your work! It is super interesting to see the results for all these architectures. Good to know that the cased variant performs even better than the uncased one. We'll keep these results in mind moving forward!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper