EsperBERTo: A RoBERTa-like model for Esperanto

This is a RoBERTa-like model trained from scratch on the Esperanto language.

Model description

The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus.

Model: RoBERTa-like
Layers: 6
Hidden size: 768
Heads: 12
Parameters: 84M
Tokenizer: Byte-level BPE
Vocabulary size: 52,000

Training data

The model was trained on the Esperanto portion of the OSCAR corpus (oscar.eo.txt), which is approximately 3GB in size.

Training procedure

The model was trained for one epoch on the OSCAR corpus using the Trainer API from the transformers library. The training was performed on a single GPU.

Hyperparameters

output_dir: "./EsperBERTo"
overwrite_output_dir: True
num_train_epochs: 1
per_gpu_train_batch_size: 64
save_steps: 10_000
save_total_limit: 2
prediction_loss_only: True

The final training loss was 6.1178.

Evaluation results

The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the fill-mask pipeline.

Example 1:

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)

fill_mask("La suno <mask>.")

Output:

[{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'},
 {'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'},
 {'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'},
 {'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'},
 {'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}]

Example 2:

fill_mask("Jen la komenco de bela <mask>.")

Output:

[{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'},
 {'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'},
 {'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'},
 {'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'},
 {'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}]

Intended uses & limitations

This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as:

Text Classification
Token Classification (Part-of-Speech Tagging, Named Entity Recognition)
Question Answering

Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended.