Update README.md
Browse files
README.md
CHANGED
@@ -1,13 +1,5 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
-
datasets:
|
4 |
-
- HuggingFaceTB/smoltalk
|
5 |
-
- TIGER-Lab/MathInstruct
|
6 |
-
- nvidia/OpenMathInstruct-2
|
7 |
-
- argilla/ifeval-like-data
|
8 |
-
- allenai/llama-3.1-tulu-3-70b-preference-mixture
|
9 |
-
- jondurbin/gutenberg-dpo-v0.1
|
10 |
-
- jondurbin/truthy-dpo-v0.1
|
11 |
base_model:
|
12 |
- Zyphra/Zamba2-1.2B
|
13 |
library_name: transformers
|
@@ -16,17 +8,7 @@ library_name: transformers
|
|
16 |
|
17 |
# Model Card for Zamba2-1.2B-Instruct-v2
|
18 |
|
19 |
-
Zamba2-1.2B-Instruct-v2 is derived from the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model through
|
20 |
-
|
21 |
-
1. **Supervised Fine-Tuning (SFT)** on the following datasets for 1 epoch:
|
22 |
-
- [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
|
23 |
-
- [TIGER-Lab/MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)
|
24 |
-
- [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
|
25 |
-
- [argilla/ifeval-like-data](https://huggingface.co/datasets/argilla/ifeval-like-data)
|
26 |
-
|
27 |
-
2. **Direct Preference Optimization (DPO)** was conducted in two stages:
|
28 |
-
- **First-stage DPO**: The model was trained on a subset of the [allenai/llama-3.1-tulu-3-70b-preference-mixture](https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-70b-preference-mixture) dataset for 2 epochs.
|
29 |
-
- **Second-stage DPO**: The model underwent an additional epoch of training using [jondurbin/gutenberg-dpo-v0.1](https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1), represented three times, and [jondurbin/truthy-dpo-v0.1](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1), represented once.
|
30 |
|
31 |
Zamba2-1.2B-Instruct-v2 is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks.
|
32 |
|
@@ -70,22 +52,19 @@ print((tokenizer.decode(outputs[0])))
|
|
70 |
|
71 |
## Performance
|
72 |
|
73 |
-
Zamba2-1.2B-Instruct-v2 achieves leading instruction-following
|
74 |
-
|
75 |
-
<center>
|
76 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/ceOUHVeJPhBgwTDCsR9Y6.png" width="900"/>
|
77 |
-
</center>
|
78 |
|
79 |
|
80 |
| Model | Size | IFEval | BBH | GPQA | MATH_hard | MMLU_pro | MUSR | Aggregate |
|
81 |
-
|
82 |
| Zamba2-1.2B-Instruct-v2 | 1.22B | 66.505 | 15.3259 | 1.0933 | 3.59 | 12.89 | 1.5917 | 16.8326 |
|
|
|
83 |
| Gemma-2-2b-it | 2.51B | 19.76 | 24.42 | 2.58 | 1.04 | 25.80 | 7.16 | 13.46 |
|
84 |
| SmolLM2-1.7B-Instruct | 1.71B | 53.00 | 18.30 | 3.51 | 4.89 | 20.51 | 4.53 | 17.46 |
|
85 |
| Qwen-2.5-1.5B-Instruct | 1.54B | 43.74 | 24.72 | 0.80 | 19.11 | 27.23 | 4.45 | 20.01 |
|
86 |
| Llama-3.2-1B-Instruct | 1.24B | 56.88 | 16.65 | 2.03 | 6.85 | 17.79 | 1.68 | 16.98 |
|
87 |
|
88 |
-
|
89 |
|
90 |
<center>
|
91 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/tQ-j1krA634EfTU1Lp3E7.png" width="700" alt="Zamba performance">
|
@@ -114,4 +93,11 @@ Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention archite
|
|
114 |
</center>
|
115 |
|
116 |
|
117 |
-
A standalone Pytorch implementation of Zamba2-1.2B
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
base_model:
|
4 |
- Zyphra/Zamba2-1.2B
|
5 |
library_name: transformers
|
|
|
8 |
|
9 |
# Model Card for Zamba2-1.2B-Instruct-v2
|
10 |
|
11 |
+
Zamba2-1.2B-Instruct-v2 is derived from the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model through SFT and DPO training on instruction-following and conversational datasets.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
Zamba2-1.2B-Instruct-v2 is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks.
|
14 |
|
|
|
52 |
|
53 |
## Performance
|
54 |
|
55 |
+
Zamba2-1.2B-Instruct-v2 achieves leading instruction-following performance for a model of its size and surpasses models of significantly larger size. For instance, Zamba2-1.2B-Instruct-v2 outperforms Gemma2-2B-Instruct, a very strong model over 2x its size.
|
|
|
|
|
|
|
|
|
56 |
|
57 |
|
58 |
| Model | Size | IFEval | BBH | GPQA | MATH_hard | MMLU_pro | MUSR | Aggregate |
|
59 |
+
|:-------|:------:|:--------:|:-----:|:------:|:-----------:|:----------:|:------:|:-----------:|
|
60 |
| Zamba2-1.2B-Instruct-v2 | 1.22B | 66.505 | 15.3259 | 1.0933 | 3.59 | 12.89 | 1.5917 | 16.8326 |
|
61 |
+
| Zamba2-1.2B-Instruct | 1.22B | 41.76 | 17.49 | 1.73 | 2.75 | 14.69 | 2.44 | 13.48 |
|
62 |
| Gemma-2-2b-it | 2.51B | 19.76 | 24.42 | 2.58 | 1.04 | 25.80 | 7.16 | 13.46 |
|
63 |
| SmolLM2-1.7B-Instruct | 1.71B | 53.00 | 18.30 | 3.51 | 4.89 | 20.51 | 4.53 | 17.46 |
|
64 |
| Qwen-2.5-1.5B-Instruct | 1.54B | 43.74 | 24.72 | 0.80 | 19.11 | 27.23 | 4.45 | 20.01 |
|
65 |
| Llama-3.2-1B-Instruct | 1.24B | 56.88 | 16.65 | 2.03 | 6.85 | 17.79 | 1.68 | 16.98 |
|
66 |
|
67 |
+
Due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct-v2 achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models.
|
68 |
|
69 |
<center>
|
70 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/tQ-j1krA634EfTU1Lp3E7.png" width="700" alt="Zamba performance">
|
|
|
93 |
</center>
|
94 |
|
95 |
|
96 |
+
A standalone Pytorch implementation of Zamba2-1.2B may be found [here](https://github.com/Zyphra/Zamba2).
|
97 |
+
|
98 |
+
## Training Recipe
|
99 |
+
|
100 |
+
Zamba2-1.2B-Instruct-v2 was trained on a mix of publicly available dataset including instruction-following and chat data. We experimented with various training approaches and found that the best recipe was as follows:
|
101 |
+
1) SFT for one epoch on core chat, reasoning and math datasets such as [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) and [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
|
102 |
+
2) DPO for 3 epochs on core alignment datasets including a subset of [allenai/llama-3.1-tulu-3-70b-preference-mixture](https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-70b-preference-mixture)
|
103 |
+
3) DPO on very high quality preference datasets such as [jondurbin/truthy-dpo-v0.1](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1) and [jondurbin/gutenberg-dpo-v0.1](https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1)
|