|
## Evaluation |
|
|
|
### Running evaluation runs |
|
|
|
Each pre-trained model was evaluated by fine-tuning on summarization and translation. The learning-rate was set to |
|
a constant schedule after a small warmup of 32 steps. |
|
Fine-tuning for evaluation was done on a limited set of 50K examples from the fine-tuning datasets. |
|
|
|
| | Summarization | Translation | |
|
|-----------------:|------------------|-------------------| |
|
| Dataset | CNN Dailymail NL | CCMatrix en -> nl | |
|
| #train samples | 50K | 50K | |
|
| Optimizer | AdamW | AdamW | |
|
| learning rate | 0.001 | 0.0005 | |
|
| source length | 1024 | 128 | |
|
| target length | 142 | 128 | |
|
| #eval samples | 1000 | 1000 | |
|
| wandb link | [eval_summ](https://wandb.ai/yepster/eval_dutch_cnndaily_202302_flax)|[eval_transl](https://wandb.ai/yepster/eval_dutch_ccmatrix_202302_flax) | |
|
|
|
The graph below shows the Rouge1 score for the summarization runs, evaluated |
|
after 25K and 50K examples on the [CNN Dailymail Dutch](https://huggingface.co/datasets/yhavinga/cnn_dailymail_dutch) dataset: |
|
|
|
 |
|
|
|
* Flan models perform almost instantly well on the summarization task, with `flan-t5-small` |
|
showing performance comparable to Dutch T5 base models. |
|
* After 50K examples, the `ul2` models exhibit similar performance to the `flan` models. |
|
* I am surprised by the consistent bad scores for the `long-t5` runs. I've retried the fine-tuning of these models with |
|
`float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models |
|
targeted at dealing with longer sequence lengths. |
|
|
|
The graph below shows the Bleu score for the translation runs, evaluated at step 25K and |
|
50K on the [CCMatrix](https://huggingface.co/datasets/yhavinga/ccmatrix_en_nl) dataset, from |
|
English to Dutch: |
|
|
|
 |
|
|
|
* For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also |
|
`ul2` pre-trained models are consistently better than their `Flan`, `T5 Dutch` and |
|
`mT5` counterparts. |
|
* Like with the summarization task, the `long-t5` models show bad performance, even after 50K examples. I do not understand |
|
cannot explain this at all for this translation task. With a sequence length of 128 input and output |
|
tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this. |
|
|
|
The figure below shows the evaluation scores for most models, with summarization Rouge1 on the x-axis (higher is better), |
|
and translation English to Dutch Bleu score on the y-axis (higher is better). |
|
The point size is proportional to the model size. UL2 models are blue, Flan models |
|
red, mT5 green and the other models black. |
|
|
|
 |
|
|
|
* For clarity not all models are shown. `t5-base-36L-dutch-english-cased` is model with |
|
scores comparable to `ul2-large-dutch-english`, but with slower inference. All long-t5 |
|
runs are left out, as well as the `t5-v1.1-large-dutch-cased` model whose translation fine-tuning |
|
diverged. |
|
* Across the board, for translation the models pre-trained with Dutch+English or Dutch converge faster than other models. |
|
I was surprised to see `t5-xl-4l` among the best models on translation, as it has only 4 layers, and previous tests |
|
showed that it had a very bad performance (In those tests I had forgot to force set the dropout rate to 0.0, and |
|
apparently this model was very sensitive to dropout). |
|
|