EsperBERTo: A RoBERTa-like model for Esperanto
This is a RoBERTa-like model trained from scratch on the Esperanto language.
Model description
The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus.
- Model: RoBERTa-like
- Layers: 6
- Hidden size: 768
- Heads: 12
- Parameters: 84M
- Tokenizer: Byte-level BPE
- Vocabulary size: 52,000
Training data
The model was trained on the Esperanto portion of the OSCAR corpus (oscar.eo.txt
), which is approximately 3GB in size.
Training procedure
The model was trained for one epoch on the OSCAR corpus using the Trainer
API from the transformers
library. The training was performed on a single GPU.
Hyperparameters
output_dir
: "./EsperBERTo"overwrite_output_dir
:True
num_train_epochs
: 1per_gpu_train_batch_size
: 64save_steps
: 10_000save_total_limit
: 2prediction_loss_only
:True
The final training loss was 6.1178
.
Evaluation results
The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the fill-mask
pipeline.
Example 1:
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="./EsperBERTo",
tokenizer="./EsperBERTo"
)
fill_mask("La suno <mask>.")
Output:
[{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'},
{'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'},
{'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'},
{'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'},
{'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}]
Example 2:
fill_mask("Jen la komenco de bela <mask>.")
Output:
[{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'},
{'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'},
{'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'},
{'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'},
{'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}]
Intended uses & limitations
This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as:
- Text Classification
- Token Classification (Part-of-Speech Tagging, Named Entity Recognition)
- Question Answering
Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended.
- Downloads last month
- 2