tags:
- merge
- mergekit
- lazymergekit
- flemmingmiguel/NeuDist-Ro-7B
- johannhartmann/Brezn3
- ResplendentAI/Flora_DPO_7B
base_model:
- flemmingmiguel/NeuDist-Ro-7B
- johannhartmann/Brezn3
- ResplendentAI/Flora_DPO_7B
language:
- de
- en
Spaetzle-v8-7b
This model is supposed to show adequate performance in German and English on a number of tasks, while mostly behaving well, that is, without rambling on, intermixing tokens from different templates in training and adapting, etc.
It is mostly a quick test, and considerably weaker in German grammar and orthography than DiscoLM e.g., but for use cases where this is not too important, but e.g. instruction following, reasoning, etc, it might actually be a little bit preferable.
It is a merge of the following models using LazyMergekit:
- flemmingmiguel/NeuDist-Ro-7B
- johannhartmann/Brezn3
- ResplendentAI/Flora_DPO_7B
- on the basis of mayflowergmbh/Wiedervereinigung-7b-dpo-laser
All credits are due to the creators of those original models and the training datasets involved.
For a suitable quantized version, try cstr/Spaetzle-v8-7b-GGUF
Evaluation
Open LLM Leaderboard Evaluation Results Detailed results can be found here
Metric | Value |
---|---|
Avg. | 72.27 |
AI2 Reasoning Challenge (25-Shot) | 68.69 |
HellaSwag (10-Shot) | 86.68 |
MMLU (5-Shot) | 64.60 |
TruthfulQA (0-shot) | 64.05 |
Winogrande (5-shot) | 81.45 |
GSM8k (5-shot) | 68.16 |
EQ-Bench (v2_de): 61.04 / english (v2): 78.3
Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
---|---|---|---|---|---|
Spaetzle-v8-7b | 45.31 | 75.69 | 63.94 | 45.57 | 57.63 |
AGIEval
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 25.59 | ± | 2.74 |
acc_norm | 24.80 | ± | 2.72 | ||
agieval_logiqa_en | 0 | acc | 39.63 | ± | 1.92 |
acc_norm | 39.78 | ± | 1.92 | ||
agieval_lsat_ar | 0 | acc | 23.48 | ± | 2.80 |
acc_norm | 24.35 | ± | 2.84 | ||
agieval_lsat_lr | 0 | acc | 50.98 | ± | 2.22 |
acc_norm | 51.96 | ± | 2.21 | ||
agieval_lsat_rc | 0 | acc | 62.08 | ± | 2.96 |
acc_norm | 62.83 | ± | 2.95 | ||
agieval_sat_en | 0 | acc | 78.64 | ± | 2.86 |
acc_norm | 79.13 | ± | 2.84 | ||
agieval_sat_en_without_passage | 0 | acc | 44.66 | ± | 3.47 |
acc_norm | 44.66 | ± | 3.47 | ||
agieval_sat_math | 0 | acc | 37.27 | ± | 3.27 |
acc_norm | 35.00 | ± | 3.22 |
Average: 45.31%
GPT4All
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 63.14 | ± | 1.41 |
acc_norm | 64.51 | ± | 1.40 | ||
arc_easy | 0 | acc | 85.98 | ± | 0.71 |
acc_norm | 82.49 | ± | 0.78 | ||
boolq | 1 | acc | 88.10 | ± | 0.57 |
hellaswag | 0 | acc | 66.31 | ± | 0.47 |
acc_norm | 85.17 | ± | 0.35 | ||
openbookqa | 0 | acc | 38.00 | ± | 2.17 |
acc_norm | 47.20 | ± | 2.23 | ||
piqa | 0 | acc | 83.35 | ± | 0.87 |
acc_norm | 84.17 | ± | 0.85 | ||
winogrande | 0 | acc | 78.22 | ± | 1.16 |
Average: 75.69%
TruthfulQA
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 47.74 | ± | 1.75 |
mc2 | 63.94 | ± | 1.53 |
Average: 63.94%
Bigbench
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 56.84 | ± | 3.60 |
bigbench_date_understanding | 0 | multiple_choice_grade | 66.12 | ± | 2.47 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 41.47 | ± | 3.07 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 22.01 | ± | 2.19 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 31.40 | ± | 2.08 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 23.14 | ± | 1.60 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 56.00 | ± | 2.87 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 45.00 | ± | 2.23 |
bigbench_navigate | 0 | multiple_choice_grade | 50.70 | ± | 1.58 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 70.05 | ± | 1.02 |
bigbench_ruin_names | 0 | multiple_choice_grade | 45.54 | ± | 2.36 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 26.05 | ± | 1.39 |
bigbench_snarks | 0 | multiple_choice_grade | 71.82 | ± | 3.35 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 72.92 | ± | 1.42 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 44.20 | ± | 1.57 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 22.80 | ± | 1.19 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 18.23 | ± | 0.92 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 56.00 | ± | 2.87 |
Average: 45.57%
Average score: 57.63%
💻 Usage
!pip install -qU transformers accelerate
from transformers import AutoTokenizer
import transformers
import torch
model = "cstr/Spaetzle-v8-7b"
messages = [{"role": "user", "content": "What is a large language model?"}]
tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
🧩 Configuration
The model uses ChatML and should work well with this (as it is merged from models which (mostly) saw ChatML templates in training).
models:
- model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
# no parameters necessary for base model
- model: flemmingmiguel/NeuDist-Ro-7B
parameters:
density: 0.60
weight: 0.30
- model: johannhartmann/Brezn3
parameters:
density: 0.65
weight: 0.40
- model: ResplendentAI/Flora_DPO_7B
parameters:
density: 0.6
weight: 0.3
merge_method: dare_ties
base_model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
parameters:
int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base