Spaces:

yhavinga
/

pre-training-dutch-t5-models

Sleeping

App Files Files Community

pre-training-dutch-t5-models / REMARKS.md

yhavinga

Switch to streamlit with markdown, add T5X pre-trained models

8f8e390 over 2 years ago

preview code

raw

history blame

1.84 kB

	## Miscellaneous remarks

	* Use loss regularization if you train with `bfloat16` (more info below)
	* Beware of the dropout rate in the config.json file.
	Check in a model's `config.json` what the dropout rate has been set to. Unless you
	intend to run many epochs on the same data, its worth to try a training run without dropout.
	If you want to compare losses, be sure to set the dropout rate equal.
	The smaller models can probably always be trained without.
	* For the translation task, I am not sure that a 'deep-narrow' model (e.g. base-nl36) is better than a normal model
	or even a 'wide-deep' model.
	* Training with more layers is much slower than you'd expect from the increased model size.
	It is also more difficult to get batch size and learning rate right. Below is a section
	about finding the right hyperparameters for the base-36L training.
	* The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of
	space, and the amount of memory required also depends on the length of source and target sequences.
	* PyCharms remote debugging features are useful to inspect variables on either a TPU VM or your deep-learning rig.
	* When increasing the batch size, increase the learning rate. bs * 2 -> lr * sqrt(2) is a good heuristic but mileage may
	vary.
	* Translation evaluation: the low score of the 128 seq len models on opus books may be because of the brevity penaly...
	that books may have sentences longer than 128 tokens.
	* Dataset quality is a key success factor. Do not expect a model to magically turn mediocre data into magic. This holds for
	the pre-training data, fine-tuning and also evaluating.
	* Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be
	better suited for model comparison.