leaderboard-pr-bot's picture
Adding Evaluation Results
918e428 verified
|
raw
history blame
8.08 kB
metadata
language:
  - en
license: apache-2.0
model-index:
  - name: WestSeverus-7B-DPO-v2
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 71.42
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 88.27
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.79
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 72.37
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 83.27
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 71.65
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FelixChao/WestSeverus-7B-DPO-v2
          name: Open LLM Leaderboard

WestSeverus - 7B - DPO - v2

image/png

☘️ Model Description

WestSeverus-7B-DPO-v2 is a WestLake Family model trained over WestSeverus-7B.

The model was trained on several dpo datasets and it can perform well on basic math problem.

WestSeverus-7B-DPO-v2 can be used in mathematics, chemical, physics and even coding for further research and reference.

πŸ“– Table of Contents

  1. Nous Benchmark Results

    • AGIEval
    • GPT4All
    • TruthfulQA Scores
    • BigBench
  2. Open LLM Leaderboard

    • ARC
    • HellaSwag
    • MMLU
    • TruthfulQA
    • Winogrande
    • GSM8K
  3. EvalPlus Leaderboard

    • HumanEval
    • HumanEval_Plus
    • MBPP
    • MBPP_Plus
  4. Prompt Format

  5. Quantized Models

  6. Gratitude

πŸͺ„ Nous Benchmark Results

WestSeverus-7B-DPO-v2 is currently on the top of the YALL - Yet Another LLM Leaderboard created by CultriX and it outperforms on TruthfulQA Scores and BigBench.

Model Average AGIEval GPT4All TruthfulQA Bigbench
WestSeverus-7B-DPO-v2 60.98 45.29 77.2 72.72 48.71
CultriX/Wernicke-7B-v1 60.73 45.59 77.36 71.46 48.49
mlabonne/NeuralBeagle14-7B 60.25 46.06 76.77 70.32 47.86
CultriX/MistralTrix-v1 60.05 44.98 76.62 71.44 47.17
senseable/WestLake-7B-v2 59.42 44.27 77.86 67.46 48.09
mlabonne/Daredevil-7B 58.22 44.85 76.07 64.89 47.07
microsoft/phi-2 44.61 27.96 70.84 44.46 35.17

πŸ† Open LLM Leaderboard

WestSeverus-7B-DPO-v2 is one of the top 7B model in Open LLM Leaderboard and it outperforms on TruthfulQA and GSM8K.

Metric Value
Avg. 75.29
AI2 Reasoning Challenge (25-Shot) 71.42
HellaSwag (10-Shot) 88.27
MMLU (5-Shot) 64.79
TruthfulQA (0-shot) 72.37
Winogrande (5-shot) 83.27
GSM8k (5-shot) 71.65

Detailed results can be found here

⚑ EvalPlus Leaderboard

Model HumanEval HumanEval_Plus MBPP MBPP_Plus
phi-2-2.7B 48.2 43.3 61.9 51.4
WestSeverus-7B-DPO-v2 43.3 34.1 TBD TBD
SOLAR-10.7B-Instruct-v1.0 42.1 34.3 42.9 34.6
CodeLlama-7B 37.8 34.1 57.6 45.4

image/png

βš—οΈ Prompt Format

WestSeverus-7B-DPO-v2 was trained using the ChatML prompt templates with system prompts. An example follows below:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

πŸ› οΈ Quantized Models

Another version of WestSeverus Model:

MaziyarPanahi/WestSeverus-7B-DPO-v2-GGUF

πŸ™ Gratitude

  • Thanks to @senseable for senseable/WestLake-7B-v2.
  • Thanks to @jondurbin for jondurbin/truthy-dpo-v0.1 dataset.
  • Thanks to @Charles Goddard for MergeKit.
  • Thanks to @TheBloke, @s3nh, @MaziyarPanahi for Quantized Models.
  • Thanks to @mlabonne, @CultriX for YALL - Yet Another LLM Leaderboard.
  • Thank you to all the other people in the Open Source AI community who utilized this model for further research and improvement.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 75.29
AI2 Reasoning Challenge (25-Shot) 71.42
HellaSwag (10-Shot) 88.27
MMLU (5-Shot) 64.79
TruthfulQA (0-shot) 72.37
Winogrande (5-shot) 83.27
GSM8k (5-shot) 71.65