BERTchen-v0.1

Efficiently pretrained MosaicBERT model on German CulturaX. We have released our Paper and Code.

Motivation

Encoder-only models perform well in a variety of tasks. However, their efficient pretraining and language adaptation remain underexplored. This study presents a method for training efficient, state-of-the-art German encoder-only models. Our research highlights the inefficiency of BERT models, in particular due to the plateau effect, and how architectural improvements such as the MosaicBERT architecture and curriculum learning approaches can combat it. We show the importance of an in-domain tokenizer and investigate different pretraining sequence lengths and datasets. BERTchen can beat the previous best model GottBERT on GermanQuAD, increasing the F1 score from 55.14 to 95.1 and the exact match from 73.06 to 91.9. Our research provides a foundation for training efficient encoder-only models in different languages.

Model description

BERTchen follows the architecture of MosaicBERT (introduced in) and utilizes FlashAttention 2. It is pretrained for 4 hours on one A100 40GB GPU.

The tokenizer is taken from prior efficient German pretraining work: paper and code

Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the CulturaX dataset (introduced in) is used.

Training procedure

BERTchen was pretrained using the MosaicBERT hyperparameters (which can be found in the paper and here), except for the training goal, which we set to 2,500 to better estimate the number of steps the model will make. In addition, we use a batch size of 1024, with a sequence length of 512 as we found this to work better. All training configs can be found here.

Evaluation results

After finetuning on Germanquad, Germeval 2017 B and GerMS-Detect subtask 1 from Germeval 2024, we get the following results:

Task	Germanquad (F1/EM)	Germeval 2017 B	GerMS-Detect as majority vote
	95.1/91.9	0.962	0.908

Efficiency

With MosaicBERT and FlashAttention 2, we can increase the throughput from 190,000 tokens per second of BERT to about 250,000 tokens per second and achieve a MFU of 65.87% (see the paper for more details and calculations).

Model variations

For the creation of BERTchen we tested different datasets and training setups. Two notable variants are:

BERTchen-v0.1-C4 Same pretraining setup and hyperparameters just on the C4 dataset.
hybrid_BERTchen-v0.1 Pretrained on CulturaX with own hybrid sequence length changing approach (For more information see model card or paper)

frederic-sadrieh
/

BERTchen-v0.1