|
## Pre-training |
|
|
|
### mC4 dataset |
|
|
|
Together with the T5 model architecture and SeqIO, the T5 authors also created and released |
|
the multilingual [mC4 dataset](https://huggingface.co/datasets/allenai/c4). |
|
It was made available by AllenNLP on the HuggingFace Dataset hub. |
|
Our team confirmed that the Dutch portion of the mC4 dataset was deduplicated, |
|
and we cleaned the Dutch portion of the mC4 dataset using [code adapted](https://gitlab.com/yhavinga/c4nlpreproc) from the TensorFlow C4 dataset. |
|
The resulting [mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned) dataset on the HuggingFace hub |
|
has configs for several sizes, and also configs for mixed Dutch and English |
|
texts, e.g. [micro_en_nl](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train). |
|
The `_en_nl` configs were added to accommodate multi-language pre-training |
|
with the Huggingface pre-training script, that accepts only a single dataset as input. |
|
Cleaned English C4 is roughly 5 times larger than its Dutch counterpart. Therefore, |
|
interleaving the datasets in a 1:1 ratio results in discarding approximately 80% of the English data. |
|
(When pre-training with T5X and SeqIO, it is possible to define task mixtures that include multiple datasets, |
|
so these `_en_nl` configs are not needed.) |
|
|
|
The full, cleaned Dutch mC4 dataset is 151GB and remains (as of June 2022) the largest available Dutch |
|
corpus on the HuggingFace Dataset hub. |
|
|
|
### Additional books, Wikipedia and Dutch news articles datasets |
|
|
|
The `t5_1_1` and `ul2` models have also been trained on Dutch books, the Dutch subset of Wikipedia (2022-03-20), |
|
the English subset of Wikipedia (2022-03-01), and a subset of "mc4_nl_cleaned" containing only texts |
|
from Dutch and Belgian newspapers. Mixing in the these datasets was done to bias the model towards |
|
descriptions of events in the Netherlands and Belgium. |
|
|
|
### Pre-Training Objectives |
|
|
|
The T5 models are pre-trained using the [span corruption](https://arxiv.org/abs/1910.10683) denoising objective. |
|
15% of the tokens in the text are masked, and each span |
|
of masked tokens is replaced with a special token known as a sentinel token, where each span is assigned |
|
its own sentinel token. The model is then trained to predict for each sentinel token the original text |
|
that was replaced by the sentinel tokens. |
|
|
|
The UL2 models are pre-trained with the [Mixture-of-Denoisers (MoD)](https://arxiv.org/abs/2205.05131) objective, that combines diverse pre-training |
|
paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where |
|
the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers |
|
that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of |
|
three denoising tasks: |
|
|
|
1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective; |
|
2. X-denoising (or extreme span corruption); and |
|
3. S-denoising (or sequential PrefixLM). |
|
|
|
### Pre-training software |
|
|
|
#### Huggingface [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py) |
|
|
|
All models except `t5_1_1` and `ul2` were pre-trained using the Huggingface `run_t5_mlm_flax.py` script. |
|
This script is a good fit if you want to get a grasp what's needed to pre-train a language model |
|
with Flax and Jax, since all data preparation, model instantiation, loss function, and training loop are |
|
contained in a single file. |
|
|
|
#### Google's [T5X](https://github.com/google-research/t5x) |
|
|
|
The Dutch `t5_1_1` and `ul2` models were pre-trained using T5X. This is a modular framework that can be used for |
|
pre-training, fine-tuning, and evaluation of T5 models. Because of its modular and pluggable design, |
|
by only supplying a few configuration and code files, it is possible to pre-train with your own definitions. |
|
It is even possible to define custom neural network layers and architectures, though I did not do this and only |
|
pre-trained the default T5 encoder-decoder architecture, and varied only the pre-training objective, and the |
|
datasets used and mixed with SeqIO. |
|
|
|
#### Conversion script from T5X to HF |
|
|
|
The T5X models were converted to Huggingface Flax T5 format using a script that was adapted from the |
|
[T5X checkpoint to HuggingFace Flax conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_flax.py). |
|
This script was modified to cast weights to bf16, and to also convert to pytorch format. |
|
For this conversion to be successful, the T5X model had to be saved with `use_gda=False` set in the GIN file. |
|
|