brittlewis12's picture
Update README.md
38be3f9 verified
metadata
base_model: microsoft/Phi-3-mini-4k-instruct
inference: false
license: mit
license_link: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/resolve/main/LICENSE
language:
  - en
pipeline_tag: text-generation
tags:
  - nlp
  - code
model_creator: microsoft
model_name: Phi-3-mini-4k-instruct
model_type: phi3
quantized_by: brittlewis12

Phi 3 Mini 4K Instruct GGUF

Updated with Microsoft’s latest model changes as of July 21, 2024

Original model: Phi-3-mini-4k-instruct

Model creator: Microsoft

This repo contains GGUF format model files for Microsoft’s Phi 3 Mini 4K Instruct.

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.

Learn more on Microsoft’s Model page.

What is GGUF?

GGUF is a file format for representing AI models. It is the third version of the format, introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Converted with llama.cpp build 3432 (revision 45f2c19), using autogguf.

Prompt template

<|system|>
{{system_prompt}}<|end|>
<|user|>
{{prompt}}<|end|>
<|assistant|>

Download & run with cnvrs on iPhone, iPad, and Mac!

cnvrs.ai

cnvrs is the best app for private, local AI on your device:

  • create & save Characters with custom system prompts & temperature settings
  • download and experiment with any GGUF model you can find on HuggingFace!
  • make it your own with custom Theme colors
  • powered by Metal ⚡️ & Llama.cpp, with haptics during response streaming!
  • try it out yourself today, on Testflight!
  • follow cnvrs on twitter to stay up to date

Original Model Evaluation

Comparison of July update vs original April release:

Benchmarks Original June 2024 Update
Instruction Extra Hard 5.7 6.0
Instruction Hard 4.9 5.1
Instructions Challenge 24.6 42.3
JSON Structure Output 11.5 52.3
XML Structure Output 14.4 49.8
GPQA 23.7 30.6
MMLU 68.8 70.9
Average 21.9 36.7

Original April release

As is now standard, we use few-shot prompts to evaluate the models, at temperature 0. The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for Phi-3. More specifically, we do not change prompts, pick different few-shot examples, change prompt format, or do any other form of optimization for the model.

The number of k–shot examples is listed per-benchmark.

Phi-3-Mini-4K-In
3.8b
Phi-2
2.7b
Mistral
7b
Gemma
7b
Llama-3-In
8b
Mixtral
8x7b
GPT-3.5
version 1106
MMLU
5-Shot
68.8 56.3 61.7 63.6 66.5 68.4 71.4
HellaSwag
5-Shot
76.7 53.6 58.5 49.8 71.1 70.4 78.8
ANLI
7-Shot
52.8 42.5 47.1 48.7 57.3 55.2 58.1
GSM-8K
0-Shot; CoT
82.5 61.1 46.4 59.8 77.4 64.7 78.1
MedQA
2-Shot
53.8 40.9 49.6 50.0 60.5 62.2 63.4
AGIEval
0-Shot
37.5 29.8 35.1 42.1 42.0 45.2 48.4
TriviaQA
5-Shot
64.0 45.2 72.3 75.2 67.7 82.2 85.8
Arc-C
10-Shot
84.9 75.9 78.6 78.3 82.8 87.3 87.4
Arc-E
10-Shot
94.6 88.5 90.6 91.4 93.4 95.6 96.3
PIQA
5-Shot
84.2 60.2 77.7 78.1 75.7 86.0 86.6
SociQA
5-Shot
76.6 68.3 74.6 65.5 73.9 75.9 68.3
BigBench-Hard
0-Shot
71.7 59.4 57.3 59.6 51.5 69.7 68.32
WinoGrande
5-Shot
70.8 54.7 54.2 55.6 65 62.0 68.8
OpenBookQA
10-Shot
83.2 73.6 79.8 78.6 82.6 85.8 86.0
BoolQ
0-Shot
77.6 -- 72.2 66.0 80.9 77.6 79.1
CommonSenseQA
10-Shot
80.2 69.3 72.6 76.2 79 78.1 79.6
TruthfulQA
10-Shot
65.0 -- 52.1 53.0 63.2 60.1 85.8
HumanEval
0-Shot
59.1 47.0 28.0 34.1 60.4 37.8 62.2
MBPP
3-Shot
53.8 60.6 50.8 51.5 67.7 60.2 77.8