BübleLM SFT WIP

BübleLM Logo

BübleLM

A small German LM

BübleLM is a German language model based on Gemma-2-2B, adapted using trans-tokenization with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.

This is an experimental version that received some finetuning using several german datasets. DPO version will follow soon.

Model Details

  • Architecture: Based on Gemma-2B decoder-only architecture
  • Parameters: 2 billion
  • Tokenizer: Custom German SentencePiece tokenizer (20k vocabulary)
    • Fertility rate: 1.78 tokens per word
    • Optimized for German morphological structures
    • Trained on the same corpus as the model
  • Context Length: 8192 tokens
  • Training Hardware: Single node with 4x NVidia A100-SXM4-80GB GPUs

Training Data

Trained on 3.5B tokens from Occiglot-FineWeb project, including:

  • Contemporary web content (OSCAR 2015-2023)
  • Legislative documents (EurLex, ParlamInt)
  • News data (Tagesschau)
  • Wiki sources

Data sampling weights:

  • Wikipedia: 4x
  • News/Parliamentary: 2x
  • Other sources: 1x

Finetuning

Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.

Performance

TBD after dpo training.

Usage

Source

@article{delobelle2024buble,
    title={BübleLM: A small German LM},
    author={Delobelle, Pieter and Akbik, Alan and others},
    year={2024}
}
Downloads last month
106
Safetensors
Model size
2.12B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for johannhartmann/bueble-lm-2b-sft

Quantizations
1 model