pglo commited on
Commit
a8ee144
·
verified ·
1 Parent(s): 5cd87da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -27
README.md CHANGED
@@ -1,13 +1,5 @@
1
  ---
2
  license: apache-2.0
3
- datasets:
4
- - HuggingFaceTB/smoltalk
5
- - TIGER-Lab/MathInstruct
6
- - nvidia/OpenMathInstruct-2
7
- - argilla/ifeval-like-data
8
- - allenai/llama-3.1-tulu-3-70b-preference-mixture
9
- - jondurbin/gutenberg-dpo-v0.1
10
- - jondurbin/truthy-dpo-v0.1
11
  base_model:
12
  - Zyphra/Zamba2-1.2B
13
  library_name: transformers
@@ -16,17 +8,7 @@ library_name: transformers
16
 
17
  # Model Card for Zamba2-1.2B-Instruct-v2
18
 
19
- Zamba2-1.2B-Instruct-v2 is derived from the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model through fine-tuning on instruction-following and conversational datasets. Specifically, the fine-tuning process involved:
20
-
21
- 1. **Supervised Fine-Tuning (SFT)** on the following datasets for 1 epoch:
22
- - [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
23
- - [TIGER-Lab/MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)
24
- - [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
25
- - [argilla/ifeval-like-data](https://huggingface.co/datasets/argilla/ifeval-like-data)
26
-
27
- 2. **Direct Preference Optimization (DPO)** was conducted in two stages:
28
- - **First-stage DPO**: The model was trained on a subset of the [allenai/llama-3.1-tulu-3-70b-preference-mixture](https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-70b-preference-mixture) dataset for 2 epochs.
29
- - **Second-stage DPO**: The model underwent an additional epoch of training using [jondurbin/gutenberg-dpo-v0.1](https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1), represented three times, and [jondurbin/truthy-dpo-v0.1](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1), represented once.
30
 
31
  Zamba2-1.2B-Instruct-v2 is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks.
32
 
@@ -70,22 +52,19 @@ print((tokenizer.decode(outputs[0])))
70
 
71
  ## Performance
72
 
73
- Zamba2-1.2B-Instruct-v2 achieves leading instruction-following and multi-turn chat performance for a model of its size and matches strong models significantly larger. For instance, Zamba2-1.2B-Instruct-v2 outperforms Gemma2-2B-Instruct, a very strong model over 2x its size.
74
-
75
- <center>
76
- <img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/ceOUHVeJPhBgwTDCsR9Y6.png" width="900"/>
77
- </center>
78
 
79
 
80
  | Model | Size | IFEval | BBH | GPQA | MATH_hard | MMLU_pro | MUSR | Aggregate |
81
- |-------|------|--------|-----|------|-----------|----------|------|-----------|
82
  | Zamba2-1.2B-Instruct-v2 | 1.22B | 66.505 | 15.3259 | 1.0933 | 3.59 | 12.89 | 1.5917 | 16.8326 |
 
83
  | Gemma-2-2b-it | 2.51B | 19.76 | 24.42 | 2.58 | 1.04 | 25.80 | 7.16 | 13.46 |
84
  | SmolLM2-1.7B-Instruct | 1.71B | 53.00 | 18.30 | 3.51 | 4.89 | 20.51 | 4.53 | 17.46 |
85
  | Qwen-2.5-1.5B-Instruct | 1.54B | 43.74 | 24.72 | 0.80 | 19.11 | 27.23 | 4.45 | 20.01 |
86
  | Llama-3.2-1B-Instruct | 1.24B | 56.88 | 16.65 | 2.03 | 6.85 | 17.79 | 1.68 | 16.98 |
87
 
88
- Moreover, due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct-v2 achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models.
89
 
90
  <center>
91
  <img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/tQ-j1krA634EfTU1Lp3E7.png" width="700" alt="Zamba performance">
@@ -114,4 +93,11 @@ Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention archite
114
  </center>
115
 
116
 
117
- A standalone Pytorch implementation of Zamba2-1.2B-Instruct-v2 may be found [here](https://github.com/Zyphra/Zamba2).
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  base_model:
4
  - Zyphra/Zamba2-1.2B
5
  library_name: transformers
 
8
 
9
  # Model Card for Zamba2-1.2B-Instruct-v2
10
 
11
+ Zamba2-1.2B-Instruct-v2 is derived from the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model through SFT and DPO training on instruction-following and conversational datasets.
 
 
 
 
 
 
 
 
 
 
12
 
13
  Zamba2-1.2B-Instruct-v2 is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks.
14
 
 
52
 
53
  ## Performance
54
 
55
+ Zamba2-1.2B-Instruct-v2 achieves leading instruction-following performance for a model of its size and surpasses models of significantly larger size. For instance, Zamba2-1.2B-Instruct-v2 outperforms Gemma2-2B-Instruct, a very strong model over 2x its size.
 
 
 
 
56
 
57
 
58
  | Model | Size | IFEval | BBH | GPQA | MATH_hard | MMLU_pro | MUSR | Aggregate |
59
+ |:-------|:------:|:--------:|:-----:|:------:|:-----------:|:----------:|:------:|:-----------:|
60
  | Zamba2-1.2B-Instruct-v2 | 1.22B | 66.505 | 15.3259 | 1.0933 | 3.59 | 12.89 | 1.5917 | 16.8326 |
61
+ | Zamba2-1.2B-Instruct | 1.22B | 41.76 | 17.49 | 1.73 | 2.75 | 14.69 | 2.44 | 13.48 |
62
  | Gemma-2-2b-it | 2.51B | 19.76 | 24.42 | 2.58 | 1.04 | 25.80 | 7.16 | 13.46 |
63
  | SmolLM2-1.7B-Instruct | 1.71B | 53.00 | 18.30 | 3.51 | 4.89 | 20.51 | 4.53 | 17.46 |
64
  | Qwen-2.5-1.5B-Instruct | 1.54B | 43.74 | 24.72 | 0.80 | 19.11 | 27.23 | 4.45 | 20.01 |
65
  | Llama-3.2-1B-Instruct | 1.24B | 56.88 | 16.65 | 2.03 | 6.85 | 17.79 | 1.68 | 16.98 |
66
 
67
+ Due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct-v2 achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models.
68
 
69
  <center>
70
  <img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/tQ-j1krA634EfTU1Lp3E7.png" width="700" alt="Zamba performance">
 
93
  </center>
94
 
95
 
96
+ A standalone Pytorch implementation of Zamba2-1.2B may be found [here](https://github.com/Zyphra/Zamba2).
97
+
98
+ ## Training Recipe
99
+
100
+ Zamba2-1.2B-Instruct-v2 was trained on a mix of publicly available dataset including instruction-following and chat data. We experimented with various training approaches and found that the best recipe was as follows:
101
+ 1) SFT for one epoch on core chat, reasoning and math datasets such as [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) and [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
102
+ 2) DPO for 3 epochs on core alignment datasets including a subset of [allenai/llama-3.1-tulu-3-70b-preference-mixture](https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-70b-preference-mixture)
103
+ 3) DPO on very high quality preference datasets such as [jondurbin/truthy-dpo-v0.1](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1) and [jondurbin/gutenberg-dpo-v0.1](https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1)