BERT5urk

This repository hosts the new 1.42B Turkish T5 model named BERT5urk.

BERT5urk is part of the Turkish Model Zoo family and pretrained using the awesome T5X library with the UL2 objective.

Inspired by the great Finnish T5 and UL2 models from the Finnish NLP group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the "Scale Efficiently" paper. Many thanks to the Finnish NLP group for open-sourcing the pretraining code and models!

Pretraining Data

BERT5urk uses the Turkish part of the amazing FineWeb2 corpus. Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.

We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.

Pretraining

BERT5urk was pretrained with the awesome T5X library. Some pretraining highlights:

One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
The resulting model has 1.42B parameters

Evaluation

Detailed evaluations can be found in the Turkish Model Zoo repository. Additionally, we also fine-tuned TURNA models as it is another T5 model with 1.14B parameters for comparison.

Encoder-only Results

For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA. The overall performance can be seen in the following table:

Model Name	Overall Development	Overall Test
BERTurk (cased, 128k)	89.72	90.05
BERTurk (uncased, 128k)	89.25	89.95
BERTurk (cased, 32k)	88.98	89.49
BERTurk (uncased, 32k)	89.28	89.67
ConvBERTurk (cased)	90.06	90.27
ConvBERTurk mC4 (cased)	90.03	90.09
ConvBERTurk mC4 (uncased)	89.76	89.97
DistilBERTurk (cased)	87.95	88.16
ELECTRA Base (cased)	89.08	89.91
ELECTRA Base mC4 (cased)	89.24	90.03
ELECTRA Base mC4 (uncased)	89.09	89.62
ELECTRA Small (cased)	87.27	88.28
BERT5urk	89.96	90.26
TURNA	88.81	89.36

Encoder-decoder Results

We tried to replicate the results from the TURNA paper using the TURNA fine-tuning library.

Paraphrasing - Tatoeba

We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper is also shown in the following table:

Model	test_rouge1	test_rouge2	test_rougeL	test_bleu	test_meteor
TURNA (paper)	90.22	80.23	88.95	71.14	87.56
TURNA (replicated)	90.36	80.50	89.10	71.48	87.63
BERT5urk	90.47	80.78	89.21	71.89	87.74

Paraphrasing - OpenSubtitles

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

Model	test_rouge1	test_rouge2	test_rougeL	test_bleu	test_meteor
TURNA (paper)	78.43	63.58	76.81	51.47	74.79
TURNA (replicated)	78.36	63.42	76.71	51.39	74.94
BERT5urk	78.56	63.80	76.95	51.74	75.07

Title Generation - TrNews

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

Model	test_rouge1	test_rouge2	test_rougeL	test_bleu	test_meteor
TURNA (paper)	36.47	22.88	35.47	12.64	23.62
TURNA (replicated)	41.65	27.60	36.77	18.60	34.55
BERT5urk	41.79	27.77	37.00	19.08	34.69

Summarization - TrNews

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

Model	test_rouge1	test_rouge2	test_rougeL	test_bleu	test_meteor
TURNA (paper)	41.77	27.81	36.99	19.05	34.61
TURNA (replicated)	40.75	26.82	35.88	18.00	33.91
BERT5urk	41.00	27.08	36.24	18.78	23.96

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs over many years ❤️

Made from Bavarian Oberland with ❤️ and 🥨.

stefan-it
/

bert5urk

BERT5urk

Pretraining Data

Pretraining

Evaluation

Encoder-only Results

Encoder-decoder Results

Paraphrasing - Tatoeba

Paraphrasing - OpenSubtitles

Title Generation - TrNews

Summarization - TrNews

Acknowledgments

Dataset used to train stefan-it/bert5urk

Collection including stefan-it/bert5urk

🇹🇷 Turkish Language Models