Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +192 -0
config.json +33 -0
model.safetensors +3 -0
special_tokens_map.json +24 -0
tokenizer.model +3 -0
tokenizer_config.json +322 -0
trainer_state.json +0 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,192 @@

+---
+language:
+- de
+tags:
+- german
+- causal-lm
+- text-generation
+library_name: transformers
+pipeline_tag: text-generation
+license: apache-2.0
+---
+# BübleLM
+<div align="center" style="margin-bottom: 2rem; margin-top: 2rem">
+    <img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/>
+    <h1 style="margin-top: 1rem;">BübleLM</h1>
+    <p><em>A small German LM</em></p>
+</div>
+BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
+## Model Details
+- **Architecture**: Based on Gemma-2B decoder-only architecture
+- **Parameters**: 2 billion
+- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
+  - Fertility rate: 1.78 tokens per word
+  - Optimized for German morphological structures
+  - Trained on the same corpus as the model
+- **Context Length**: 8192 tokens
+- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
+## Training Data
+Trained on 3.5B tokens from Occiglot-FineWeb project, including:
+- Contemporary web content (OSCAR 2015-2023)
+- Legislative documents (EurLex, ParlamInt)
+- News data (Tagesschau)
+- Wiki sources
+Data sampling weights:
+- Wikipedia: 4x
+- News/Parliamentary: 2x
+- Other sources: 1x
+## Performance
+Key improvements over Gemma-2-2B baseline:
+- HellaSwag-DE: +71% (47.9% vs 28.0%)
+- ARC-DE: +41% (32.3% vs 22.9%)
+- Average zero-shot: +40% (35.8% vs 25.5%)
+→ BübleLM-2B consistently outperforms both the base Gemma-2-2B and other German models like LLäMmlein-1B across most tasks.
+<table class="model-comparison">
+  <thead>
+    <tr>
+      <th align="left">Model</th>
+      <th align="center" colspan="2">ARC-DE</th>
+      <th align="center" colspan="2">HellaSwag-DE</th>
+      <th align="center">TruthfulQA-DE</th>
+      <th align="center">Average</th>
+    </tr>
+    <tr>
+      <th></th>
+      <th align="center">0-shot</th>
+      <th align="center">3-shot</th>
+      <th align="center">0-shot</th>
+      <th align="center">3-shot</th>
+      <th align="center">0-shot</th>
+      <th align="center">0-shot</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><a href="https://huggingface.co/google/gemma-2-2b" target="_blank">Gemma-2-2B</a></td>
+      <td align="center">22.9</td>
+      <td align="center">23.1</td>
+      <td align="center">28.0</td>
+      <td align="center">27.6</td>
+      <td align="center">25.5</td>
+      <td align="center">25.5</td>
+    </tr>
+    <tr>
+      <td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_120M" target="_blank">LLäMmlein-120M</a></td>
+      <td align="center">24.7 ↑+8%</td>
+      <td align="center">-</td>
+      <td align="center">32.0 ↑+14%</td>
+      <td align="center">-</td>
+      <td align="center">25.0 ↓-2%</td>
+      <td align="center">27.2 ↑+7%</td>
+    </tr>
+    <tr>
+      <td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_1B" target="_blank">LLäMmlein-1B</a></td>
+      <td align="center">30.0 ↑+31%</td>
+      <td align="center">-</td>
+      <td align="center"><strong>48.5</strong> ↑+73%</td>
+      <td align="center">-</td>
+      <td align="center">23.4 ↓-8%</td>
+      <td align="center">34.0 ↑+33%</td>
+    </tr>
+    <tr>
+      <td><a href="https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-2b" target="_blank">Sauerkraut-Gemma-2B</a></td>
+      <td align="center">28.0 ↑+22%</td>
+      <td align="center">34.6 ↑+50%</td>
+      <td align="center">37.2 ↑+33%</td>
+      <td align="center">44.1 ↑+60%</td>
+      <td align="center"><strong>32.9</strong> ↑+29%</td>
+      <td align="center">32.7 ↑+28%</td>
+    </tr>
+    <tr>
+      <td><strong>BübleLM (Ours)</strong></td>
+      <td align="center"><strong>32.3</strong> ↑+41%</td>
+      <td align="center"><strong>35.2</strong> ↑+52%</td>
+      <td align="center">47.9 ↑+71%</td>
+      <td align="center"><strong>46.6</strong> ↑+69%</td>
+      <td align="center">27.2 ↑+7%</td>
+      <td align="center"><strong>35.8</strong> ↑+40%</td>
+    </tr>
+  </tbody>
+</table>
+*Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.*
+## Safety & Ethics
+### Toxicity
+- Perplexity: 52.97 on German TextDetox dataset
+- Toxic content appears more out-of-distribution compared to baseline
+### Gender Bias
+- Evaluated using perplexity differences between traditional and gender-inclusive forms
+- Slight preference for gender-inclusive language (not statistically significant)
+- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
+## Usage
+**Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
+Also make sure you have the sentencepiece tokenizer installed:
+```bash
+pip install sentencepiece
+```
+```python
+from transformers import pipeline
+pipe = pipeline("text-generation", model="flair/bueble-lm-2b")
+pipe("Ich bin")
+```
+Or with the full model api:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
+model = AutoModelForCausalLM.from_pretrained(
+    "flair/bueble-lm-2b",
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+)
+# Basic text completion
+text = "Berlin ist eine Stadt, die"
+inputs = tokenizer(text, return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=256)
+print(tokenizer.decode(outputs[0]))
+```
+For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets.
+## Limitations
+- Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma)
+- Performance may vary on specialized domains not well-represented in training data
+- Higher fertility rate (1.78) due to smaller vocabulary size
+- Inherits base limitations from Gemma architecture
+## Citation
+```bibtex
+@article{delobelle2024buble,
+    title={BübleLM: A small German LM},
+    author={Delobelle, Pieter and Akbik, Alan and others},
+    year={2024}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_name_or_path": "pdelobelle/gemma-2-2b-de",
+  "architectures": [
+    "Gemma2ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "attn_logit_softcapping": 50.0,
+  "bos_token_id": 2,
+  "cache_implementation": "hybrid",
+  "eos_token_id": 2,
+  "final_logit_softcapping": 30.0,
+  "head_dim": 256,
+  "hidden_act": "gelu_pytorch_tanh",
+  "hidden_activation": "gelu_pytorch_tanh",
+  "hidden_size": 2304,
+  "initializer_range": 0.02,
+  "intermediate_size": 9216,
+  "max_position_embeddings": 8192,
+  "model_type": "gemma2",
+  "num_attention_heads": 8,
+  "num_hidden_layers": 26,
+  "num_key_value_heads": 4,
+  "pad_token_id": 3,
+  "query_pre_attn_scalar": 256,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 10000.0,
+  "sliding_window": 4096,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.44.1",
+  "use_cache": true,
+  "vocab_size": 20000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:986191e4e7d52df000fc1b3b15c3d44e59bb2123e4a33ec2831b91532bb61fc1
+size 4141229384

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "</s>",
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:16e8773affbd03448ffb79173feed1884514012160d3074641ee402dcad4f481
+size 579378

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,322 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": null,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "5": {
+      "content": "▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "6": {
+      "content": "▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "7": {
+      "content": "▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "8": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "9": {
+      "content": "--",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "10": {
+      "content": "----",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "11": {
+      "content": "-----",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "12": {
+      "content": "--------",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "13": {
+      "content": "----------------",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "14": {
+      "content": "++",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "15": {
+      "content": "/**",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "16": {
+      "content": "***",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "17": {
+      "content": "****",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "18": {
+      "content": "******",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "19": {
+      "content": "********",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "20": {
+      "content": "**/",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "21": {
+      "content": "##",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "22": {
+      "content": "###",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "23": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "24": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "25": {
+      "content": "<|system|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "26": {
+      "content": "<|user|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "27": {
+      "content": "<|assistant|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "28": {
+      "content": "▁—",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "29": {
+      "content": "▁“",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "30": {
+      "content": "“",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "31": {
+      "content": "”",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32": {
+      "content": "’",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "33": {
+      "content": "—",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "34": {
+      "content": "{",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "35": {
+      "content": "}\"",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "36": {
+      "content": "{\"",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "37": {
+      "content": "}",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "legacy": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "</s>",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}

trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:41a61c806a0bb2deecb47b569453df82a098015e8a830d9b725661463eeec7db
+size 5304