Spaces:

yhavinga
/

pre-training-dutch-t5-models

Sleeping

App Files Files Community

pre-training-dutch-t5-models / EVALUATION.md

yhavinga

Switch to streamlit with markdown, add T5X pre-trained models

8f8e390 over 2 years ago

preview code

raw

history blame

3.71 kB

	## Evaluation

	### Running evaluation runs

	Each pre-trained model was evaluated by fine-tuning on summarization and translation. The learning-rate was set to
	a constant schedule after a small warmup of 32 steps.
	Fine-tuning for evaluation was done on a limited set of 50K examples from the fine-tuning datasets.

	\| \| Summarization \| Translation \|
	\|-----------------:\|------------------\|-------------------\|
	\| Dataset \| CNN Dailymail NL \| CCMatrix en -> nl \|
	\| #train samples \| 50K \| 50K \|
	\| Optimizer \| AdamW \| AdamW \|
	\| learning rate \| 0.001 \| 0.0005 \|
	\| source length \| 1024 \| 128 \|
	\| target length \| 142 \| 128 \|
	\| #eval samples \| 1000 \| 1000 \|
	\| wandb link \| [eval_summ](https://wandb.ai/yepster/eval_dutch_cnndaily_202302_flax)\|[eval_transl](https://wandb.ai/yepster/eval_dutch_ccmatrix_202302_flax) \|

	The graph below shows the Rouge1 score for the summarization runs, evaluated
	after 25K and 50K examples on the [CNN Dailymail Dutch](https://huggingface.co/datasets/yhavinga/cnn_dailymail_dutch) dataset:

	![Rouge1 summarization](eval_summ_rouge1_202302.png)

	* Flan models perform almost instantly well on the summarization task, with `flan-t5-small`
	showing performance comparable to Dutch T5 base models.
	* After 50K examples, the `ul2` models exhibit similar performance to the `flan` models.
	* I am surprised by the consistent bad scores for the `long-t5` runs. I've retried the fine-tuning of these models with
	`float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
	targeted at dealing with longer sequence lengths.

	The graph below shows the Bleu score for the translation runs, evaluated at step 25K and
	50K on the [CCMatrix](https://huggingface.co/datasets/yhavinga/ccmatrix_en_nl) dataset, from
	English to Dutch:

	![Bleu score translation](eval_transl_bleu_202302.png)

	* For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
	`ul2` pre-trained models are consistently better than their `Flan`, `T5 Dutch` and
	`mT5` counterparts.
	* Like with the summarization task, the `long-t5` models show bad performance, even after 50K examples. I do not understand
	cannot explain this at all for this translation task. With a sequence length of 128 input and output
	tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.

	The figure below shows the evaluation scores for most models, with summarization Rouge1 on the x-axis (higher is better),
	and translation English to Dutch Bleu score on the y-axis (higher is better).
	The point size is proportional to the model size. UL2 models are blue, Flan models
	red, mT5 green and the other models black.

	![Evaluation T5 Dutch English](eval_t5_dutch_english.png)

	* For clarity not all models are shown. `t5-base-36L-dutch-english-cased` is model with
	scores comparable to `ul2-large-dutch-english`, but with slower inference. All long-t5
	runs are left out, as well as the `t5-v1.1-large-dutch-cased` model whose translation fine-tuning
	diverged.
	* Across the board, for translation the models pre-trained with Dutch+English or Dutch converge faster than other models.
	I was surprised to see `t5-xl-4l` among the best models on translation, as it has only 4 layers, and previous tests
	showed that it had a very bad performance (In those tests I had forgot to force set the dropout rate to 0.0, and
	apparently this model was very sensitive to dropout).