BübleLM SFT WIP

BübleLM

A small German LM

BübleLM is a German language model based on Gemma-2-2B, adapted using trans-tokenization with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.

This is an experimental version that received some finetuning using several german datasets. DPO version will follow soon.

Model Details

Architecture: Based on Gemma-2B decoder-only architecture
Parameters: 2 billion
Tokenizer: Custom German SentencePiece tokenizer (20k vocabulary)
- Fertility rate: 1.78 tokens per word
- Optimized for German morphological structures
- Trained on the same corpus as the model
Context Length: 8192 tokens
Training Hardware: Single node with 4x NVidia A100-SXM4-80GB GPUs

Training Data

Trained on 3.5B tokens from Occiglot-FineWeb project, including:

Contemporary web content (OSCAR 2015-2023)
Legislative documents (EurLex, ParlamInt)
News data (Tagesschau)
Wiki sources

Data sampling weights:

Wikipedia: 4x
News/Parliamentary: 2x
Other sources: 1x

Finetuning

Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.

Performance

TBD after dpo training.

Usage

Source

@article{delobelle2024buble,
    title={BübleLM: A small German LM},
    author={Delobelle, Pieter and Akbik, Alan and others},
    year={2024}
}

johannhartmann
/

bueble-lm-2b-sft