metadata

language:
  - en
license: apache-2.0
model-index:
  - name: WestSeverus-7B-DPO-v2
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 71.42
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 88.27
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.79
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 72.37
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 83.27
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 71.65
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard

WestSeverus - 7B - DPO - v2

☘️ Model Description

WestSeverus-7B-DPO-v2 is a WestLake Family model trained over WestSeverus-7B.

The model was trained on several dpo datasets and it can perform well on basic math problem.

WestSeverus-7B-DPO-v2 can be used in mathematics, chemical, physics and even coding for further research and reference.

📖 Table of Contents

Nous Benchmark Results
- AGIEval
- GPT4All
- TruthfulQA Scores
- BigBench
Open LLM Leaderboard
- ARC
- HellaSwag
- MMLU
- TruthfulQA
- Winogrande
- GSM8K
EvalPlus Leaderboard
- HumanEval
- HumanEval_Plus
- MBPP
- MBPP_Plus
Prompt Format
Quantized Models
Gratitude

🪄 Nous Benchmark Results

WestSeverus-7B-DPO-v2 is currently on the top of the YALL - Yet Another LLM Leaderboard created by CultriX and it outperforms on TruthfulQA Scores and BigBench.

Model	Average	AGIEval	GPT4All	TruthfulQA	Bigbench
WestSeverus-7B-DPO-v2	60.98	45.29	77.2	72.72	48.71
CultriX/Wernicke-7B-v1	60.73	45.59	77.36	71.46	48.49
mlabonne/NeuralBeagle14-7B	60.25	46.06	76.77	70.32	47.86
CultriX/MistralTrix-v1	60.05	44.98	76.62	71.44	47.17
senseable/WestLake-7B-v2	59.42	44.27	77.86	67.46	48.09
mlabonne/Daredevil-7B	58.22	44.85	76.07	64.89	47.07
microsoft/phi-2	44.61	27.96	70.84	44.46	35.17

🏆 Open LLM Leaderboard

WestSeverus-7B-DPO-v2 is one of the top 7B model in Open LLM Leaderboard and it outperforms on TruthfulQA and GSM8K.

Metric	Value
Avg.	75.29
AI2 Reasoning Challenge (25-Shot)	71.42
HellaSwag (10-Shot)	88.27
MMLU (5-Shot)	64.79
TruthfulQA (0-shot)	72.37
Winogrande (5-shot)	83.27
GSM8k (5-shot)	71.65

Detailed results can be found here

⚡ EvalPlus Leaderboard

Model	HumanEval	HumanEval_Plus	MBPP	MBPP_Plus
phi-2-2.7B	48.2	43.3	61.9	51.4
WestSeverus-7B-DPO-v2	43.3	34.1	TBD	TBD
SOLAR-10.7B-Instruct-v1.0	42.1	34.3	42.9	34.6
CodeLlama-7B	37.8	34.1	57.6	45.4

⚗️ Prompt Format

WestSeverus-7B-DPO-v2 was trained using the ChatML prompt templates with system prompts. An example follows below:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

🛠️ Quantized Models

Another version of WestSeverus Model:

PetroGPT/WestSeverus-7B-DPO
GGUF: https://huggingface.co/TheBloke/WestSeverus-7B-DPO-GGUF
GGUF: https://huggingface.co/s3nh/WestSeverus-7B-DPO-GGUF
GPTQ: https://huggingface.co/TheBloke/WestSeverus-7B-DPO-GPTQ
AWQ: https://huggingface.co/TheBloke/WestSeverus-7B-DPO-AWQ

MaziyarPanahi/WestSeverus-7B-DPO-v2-GGUF

GGUF: https://huggingface.co/MaziyarPanahi/WestSeverus-7B-DPO-v2-GGUF

🙏 Gratitude

Thanks to @senseable for senseable/WestLake-7B-v2.
Thanks to @jondurbin for jondurbin/truthy-dpo-v0.1 dataset.
Thanks to @Charles Goddard for MergeKit.
Thanks to @TheBloke, @s3nh, @MaziyarPanahi for Quantized Models.
Thanks to @mlabonne, @CultriX for YALL - Yet Another LLM Leaderboard.
Thank you to all the other people in the Open Source AI community who utilized this model for further research and improvement.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	75.29
AI2 Reasoning Challenge (25-Shot)	71.42
HellaSwag (10-Shot)	88.27
MMLU (5-Shot)	64.79
TruthfulQA (0-shot)	72.37
Winogrande (5-shot)	83.27
GSM8k (5-shot)	71.65