BERT5urk

BERT5urk

This repository hosts the new 1.42B Turkish T5 model named BERT5urk.

BERT5urk is part of the Turkish Model Zoo family and pretrained using the awesome T5X library with the UL2 objective.

Inspired by the great Finnish T5 and UL2 models from the Finnish NLP group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the "Scale Efficiently" paper. Many thanks to the Finnish NLP group for open-sourcing the pretraining code and models!

Pretraining Data

BERT5urk uses the Turkish part of the amazing FineWeb2 corpus. Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.

We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.

Pretraining

BERT5urk was pretrained with the awesome T5X library. Some pretraining highlights:

  • One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
  • Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
  • The resulting model has 1.42B parameters

Evaluation

Detailed evaluations can be found in the Turkish Model Zoo repository. Additionally, we also fine-tuned TURNA models as it is another T5 model with 1.14B parameters for comparison.

Encoder-only Results

For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA. The overall performance can be seen in the following table:

Model Name Overall Development Overall Test
BERTurk (cased, 128k) 89.72 90.05
BERTurk (uncased, 128k) 89.25 89.95
BERTurk (cased, 32k) 88.98 89.49
BERTurk (uncased, 32k) 89.28 89.67
ConvBERTurk (cased) 90.06 90.27
ConvBERTurk mC4 (cased) 90.03 90.09
ConvBERTurk mC4 (uncased) 89.76 89.97
DistilBERTurk (cased) 87.95 88.16
ELECTRA Base (cased) 89.08 89.91
ELECTRA Base mC4 (cased) 89.24 90.03
ELECTRA Base mC4 (uncased) 89.09 89.62
ELECTRA Small (cased) 87.27 88.28
BERT5urk 89.96 90.26
TURNA 88.81 89.36

Encoder-decoder Results

We tried to replicate the results from the TURNA paper using the TURNA fine-tuning library.

Paraphrasing - Tatoeba

We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper is also shown in the following table:

Model test_rouge1 test_rouge2 test_rougeL test_bleu test_meteor
TURNA (paper) 90.22 80.23 88.95 71.14 87.56
TURNA (replicated) 90.36 80.50 89.10 71.48 87.63
BERT5urk 90.47 80.78 89.21 71.89 87.74

Paraphrasing - OpenSubtitles

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

Model test_rouge1 test_rouge2 test_rougeL test_bleu test_meteor
TURNA (paper) 78.43 63.58 76.81 51.47 74.79
TURNA (replicated) 78.36 63.42 76.71 51.39 74.94
BERT5urk 78.56 63.80 76.95 51.74 75.07

Title Generation - TrNews

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

Model test_rouge1 test_rouge2 test_rougeL test_bleu test_meteor
TURNA (paper) 36.47 22.88 35.47 12.64 23.62
TURNA (replicated) 41.65 27.60 36.77 18.60 34.55
BERT5urk 41.79 27.77 37.00 19.08 34.69

Summarization - TrNews

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

Model test_rouge1 test_rouge2 test_rougeL test_bleu test_meteor
TURNA (paper) 41.77 27.81 36.99 19.05 34.61
TURNA (replicated) 40.75 26.82 35.88 18.00 33.91
BERT5urk 41.00 27.08 36.24 18.78 23.96

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs over many years ❤️

Made from Bavarian Oberland with ❤️ and 🥨.

Downloads last month
92
Safetensors
Model size
1.42B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train stefan-it/bert5urk