TheBloke
/

OpenOrcaxOpenChat-Preview2-13B-GPTQ

@@ -50,20 +50,31 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
 User: {prompt}<|end_of_turn|>Assistant:
 ```
-## Provided files
 Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
 Each separate quant is in a different branch.  See below for instructions on fetching from different branches.
-| Branch | Bits | Group Size | Act Order (desc_act) | GPTQ Dataset | Size | ExLlama Compat? | Made With | Desc |
-| ------ | ---- | ---------- | -------------------- | ------------ | ---- | --------------- | --------- | ---- |
-| main | 4 | 128 | No | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 7.26 GB | Yes | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
-| gptq-4bit-32g-actorder_True | 4 | 32 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 8.00 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
-| gptq-4bit-64g-actorder_True | 4 | 64 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 7.51 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
-| gptq-4bit-128g-actorder_True | 4 | 128 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 7.26 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
-| gptq-8bit--1g-actorder_True | 8 | None | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 13.36 GB | No | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
-| gptq-8bit-128g-actorder_True | 8 | 128 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 13.65 GB | No | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
 ## How to download from branches
@@ -218,13 +229,13 @@ Thank you to all my generous patrons and donaters!
 We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
 This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
-This second preview release is trained on a curated filtered subset of most of our GPT4 augmented data.
 This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
-We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding ~103% of original Orca's performance on average.
-As well, this is done with ~1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
-We have run extensive evaluations internally and expect this model to place number 1 on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
 "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
 We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
@@ -253,46 +264,58 @@ We have evaluated **OpenOrcaxOpenChat-Preview2-13B** on hard reasoning tasks fro
 Our average performance for BigBench-Hard: 0.488
-Average for AGIEval: 0.441
 In the Orca paper, they measured their score relative to Vicuna on these evals.
-We've done the same and have found our score averages to >103% of the total improvement that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
-So we are surpassing Orca performance with <20% of the dataset size and ~1/10th the training budget!
-## BigBench-Hard Performance
-![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_BigBenchHard.png "BigBench-Hard Performance")
 ## AGIEval Performance
-![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "AGIEval Performance")
 ## HuggingFaceH4 Open LLM Leaderboard Performance
 We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
-We find
-![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_HFLeaderboard.png "GPT4ALL Performance")
 ## GPT4ALL Leaderboard Performance
 We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
-We place #1 for all open models and come within comparison of text-davinci-003, a proprietary model an order of magnitude larger.
-![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "GPT4ALL Performance")
 # Dataset
 We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
-Further details of our curation practices will be forthcoming with our full model release.
 # Training
-We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset.
-This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs.
 Our compute requirement was <1/10th that of the original Orca.
 Commodity cost was ~$600.
@@ -315,6 +338,18 @@ tokenize("User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are yo
 # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
 ```
 # Serving

 User: {prompt}<|end_of_turn|>Assistant:
 ```
+## Provided files and GPTQ parameters
 Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
 Each separate quant is in a different branch.  See below for instructions on fetching from different branches.
+<details>
+  <summary>Explanation of GPTQ parameters</summary>
+- Bits: The bit size of the quantised model.
+- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
+- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
+- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
+- GPTQ dataset: The dataset used for quantisation. The dataset used for quantisation can affect the quantisation accuracy. The dataset used for quantisation is not the same as the dataset used to train the model.
+- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used.  Note that a lower sequence length does not limit the sequence length of the quantised model. It only affects the quantisation accuracy on longer inference sequences.
+- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.
+</details>
+| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama Compat? | By | Desc |
+| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | --------------- | -- | ---- |
+| main | 4 | 128 | No | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.26 GB | Yes | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
+| gptq-4bit-32g-actorder_True | 4 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 8.00 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
+| gptq-4bit-64g-actorder_True | 4 | 64 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.51 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
+| gptq-4bit-128g-actorder_True | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.26 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
+| gptq-8bit--1g-actorder_True | 8 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.36 GB | No | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
+| gptq-8bit-128g-actorder_True | 8 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.65 GB | No | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
 ## How to download from branches
 We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
 This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
+This second preview release is trained on a curated filtered subset of most of our GPT-4 augmented data.
 This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
+We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding **~103%** of original Orca's performance on average.
+As well, this is done with <1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
+We have run extensive evaluations internally and expect this model to **place number 1** on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
 "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
 We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
 Our average performance for BigBench-Hard: 0.488
+Average for AGIEval: 0.447
 In the Orca paper, they measured their score relative to Vicuna on these evals.
+We have done the same and have found our score averages to **~103%** of the total performance that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
+So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
+As well, we have evaluated using the methodology and tools for the HuggingFace Leaderboard and GPT4ALL Leaderboard, and find that we place #1 on both for all 13B models at release time!
 ## AGIEval Performance
+We present our results in two columns.
+The column for "`(Orca Paper eval)`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
+The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
+![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2AGIEval.png "AGIEval Performance")
+## BigBench-Hard Performance
+We present our results in two columns.
+The column for "`(Orca Paper eval)`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
+The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
+![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2BigBenchHardEval.png "BigBench-Hard Performance")
 ## HuggingFaceH4 Open LLM Leaderboard Performance
 We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
+We place #1 for all 13B models at release time!
+![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "GPT4ALL Performance")
 ## GPT4ALL Leaderboard Performance
 We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
+We place #1 for all open models and come within comparison of `text-davinci-003`, a proprietary OpenAI model an order of magnitude larger.
+![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2GPT4ALL_Leaderboard.png "GPT4ALL Performance")
 # Dataset
 We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
+Further details of our curation practices will be forthcoming with our full model releases.
 # Training
+We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset in one training run.
+This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs, and requiring stacked training (which is known to suffer catastrophic forgetting).
 Our compute requirement was <1/10th that of the original Orca.
 Commodity cost was ~$600.
 # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
 ```
+For UIs with Prefix and Suffix fields, these will likely work:
+Prefix (include a space after colon):
+```
+User:
+```
+Suffix (space after colon):
+```
+<|end_of_turn|>\nAssistant:
+```
 # Serving