language:

license: cc-by-sa-4.0

BERTic-Incorrect-Spelling-Annotator

This BERTic model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:

0: Word is written correctly,
1: Word is written incorrectly.

Model Output Example

Imagine we have the following Croatian text:

Model u tekstu prepoznije riječi u kojima se nalazaju pogreške .

If we convert input data to format acceptable by BERTic model:

[CLS] model [MASK] u [MASK] tekstu [MASK] prepo ##znije [MASK] riječi [MASK] u [MASK] kojima [MASK] se [MASK] nalaza ##ju [MASK] pogreške [MASK] . [MASK] [SEP]

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

Model 0 u 0 tekstu 0 prepoznije 1 riječi 0 u 0 kojima 0 se 0 nalazaju 1 pogreške 0 . 0

We can observe that in the input sentence, the word prepoznije and nalazaju are spelled incorrectly, so the model marks them with the token (1).

More details

Testing model with generated test sets provides following result:

Precision: 0.9954 Recall: 0.8764 F1 Score: 0.9321 F0.5 Score: 0.9691

Testing the model with test sets constructed using the Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0 dataset provides the following results:

Precision: 0.8213
Recall: 0.3921
F1 Score: 0.5308
F0.5 Score: 0.6738

Acknowledgement

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.

Authors

Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing this model.