File size: 5,098 Bytes
ee31535
541e78a
 
 
ee31535
 
 
 
541e78a
 
 
ee31535
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a48b7c0
 
ee31535
a48b7c0
ee31535
 
 
a48b7c0
ee31535
 
 
a48b7c0
 
 
ee31535
 
 
a48b7c0
ee31535
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
541e78a
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
language:
- en
- zh
license: apache-2.0
tags:
- axolotl
- generated_from_trainer
base_model: Qwen/Qwen2-0.5B
datasets:
- Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered
model-index:
- name: Qwen2-0.5B-Abyme
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
<details><summary>See axolotl config</summary>

axolotl version: `0.4.1`
```yaml
adapter: null
base_model: Qwen/Qwen2-0.5B
bf16: auto
chat_template: chatml
dataset_prepared_path: ./data/last_run_prepared
datasets:
- path: Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered
  type: sharegpt
deepspeed: null
early_stopping_patience: null
eval_sample_packing: true
evals_per_epoch: 4
flash_attention: true
fp16: null
fsdp: null
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
group_by_length: false
hf_use_auth_token: true
hub_model_id: CoolSpring/Qwen2-0.5B-Abyme
learning_rate: 2e-5
load_in_4bit: false
load_in_8bit: false
local_rank: null
logging_steps: 1
lr_scheduler: cosine
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_torch
output_dir: ./outputs/out
pad_to_sequence_len: true
resize_token_embeddings_to_32x: true
resume_from_checkpoint: null
sample_packing: true
saves_per_epoch: 1
sequence_len: 4096
tf32: true
tokens:
- <|im_start|>
- <|im_end|>
train_on_inputs: false
val_set_size: 0.05
wandb_entity: null
wandb_log_model: null
wandb_name: Qwen2-0.5B-Abyme
wandb_project: Qwen2-0.5B-Magpie-Qwen2-Pro-300K-Filtered
wandb_watch: null
warmup_steps: 100
weight_decay: null
xformers_attention: null

```

</details><br>

# Qwen2-0.5B-Abyme

This model is a fine-tuned version of [Qwen/Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B) on the [Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered) dataset. It was created to explore the effects of training the smallest model in the Qwen2 series on data extracted from the largest model in the Qwen2 series (as of July 18th, 2024).

It achieves the following results on the evaluation set:
- Loss: 0.8229  

## Model description

Qwen2-0.5B-Abyme is a 0.5 billion parameter language model fine-tuned on a dataset of conversation samples from the much larger 72 billion parameter Qwen2-72B model. The purpose of this experiment is to investigate whether a smaller model can effectively learn and reproduce the knowledge and capabilities of a significantly larger model through the fine-tuning process.

## Intended uses & limitations

This model is intended for research purposes to study the knowledge transfer and distillation capabilities of language models. It may have practical applications in scenarios where the computational resources for running large language models are limited, and a smaller, fine-tuned model can provide comparable performance.

However, it is important to note that the model's capabilities and limitations are yet to be fully evaluated. Its performance may vary depending on the task and domain, and it may exhibit biases or limitations inherited from the original models.

## Training and evaluation data

The model was fine-tuned on the [Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered) dataset, which contains 300,000 conversation samples from the Qwen2-72B model. 5% of this dataset was held out as the evaluation set for calculating the reported loss metric.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 1

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.9947        | 0.0004 | 1    | 0.9683          |
| 0.8385        | 0.2501 | 597  | 0.8338          |
| 0.7636        | 0.5002 | 1194 | 0.8249          |
| 0.8124        | 0.7502 | 1791 | 0.8229          |


### Framework versions

- Transformers 4.42.3
- Pytorch 2.3.1+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_CoolSpring__Qwen2-0.5B-Abyme)

|      Metric       |Value|
|-------------------|----:|
|Avg.               | 4.76|
|IFEval (0-Shot)    |19.15|
|BBH (3-Shot)       | 2.28|
|MATH Lvl 5 (4-Shot)| 1.51|
|GPQA (0-shot)      | 0.45|
|MuSR (0-shot)      | 1.48|
|MMLU-PRO (5-shot)  | 3.70|