Text Generation
Transformers
Safetensors
English
qwen2
conversational
text-generation-inference
Files changed (1) hide show
  1. README.md +216 -204
README.md CHANGED
@@ -1,205 +1,217 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - amphora/QwQ-LongCoT-130K-2
5
- - PowerInfer/QWQ-LONGCOT-500K
6
- - PowerInfer/LONGCOT-Refine-500K
7
- language:
8
- - en
9
- metrics:
10
- - perplexity
11
- base_model:
12
- - Qwen/Qwen2.5-0.5B-Instruct
13
- library_name: transformers
14
- ---
15
- ## Model Details:
16
-
17
- - **Base Model:** Qwen/Qwen2-0.5B-Instruct
18
- - **Teacher Model:** Qwen/QwQ-32B-Preview
19
- - **Distillation Framework:** Instruction Tuning
20
- - **Task Type:** Conversational AI / Causal Language Modeling
21
- - **Parameters:** 0.5B
22
- - **Special Features:**
23
- - Integrated gradient checkpointing for efficient training
24
- - Step-by-step reasoning capabilities for better problem-solving
25
-
26
- ---
27
-
28
- ## Training:
29
-
30
- QwQ-0.5B-Distilled was trained using the **QwQ-LongCoT-130K dataset**, a carefully curated collection of long-context examples designed for reasoning and conversational AI tasks. The GKD framework ensures that the student model mimics the teacher model’s outputs, aligning its predictions with high-quality responses.
31
- ### Training Progress:
32
- [▓▓▓▓▓▓▓▓▓▓] 100%
33
-
34
- ### Training Script:
35
-
36
- ```python
37
- import os
38
- import argparse
39
- import torch
40
- from datasets import Dataset
41
- from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
42
- from transformers import (
43
- AutoModelForCausalLM,
44
- AutoTokenizer,
45
- )
46
- from datasets import load_dataset
47
- from peft import LoraConfig
48
-
49
- parser = argparse.ArgumentParser()
50
- parser.add_argument("--max_length", type=int, default = 4096)
51
- parser.add_argument("--output_dir", type=str, default="gkd-model")
52
- parser.add_argument("--per_device_train_batch_size", type=int, default=1)
53
- parser.add_argument("--gradient_accumulation_steps", type=int, default=16)
54
- parser.add_argument("--gradient_checkpointing", action="store_true", default=False)
55
- parser.add_argument("--resume_from_checkpoint", action="store_true", default=False)
56
- parser.add_argument("--lora", action="store_true")
57
- args = parser.parse_args()
58
-
59
- qwq_dataset = load_dataset("amphora/QwQ-LongCoT-130K-2", split = "train")
60
- messages = []
61
- for each in qwq_dataset:
62
- msg = [
63
- {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
64
- {"role": "user", "content": each["problem"]},
65
- {"role": "assistant", "content": each["qwq"]},
66
- ]
67
- messages.append(msg)
68
-
69
- TRAIN_SPLIT_RATIO = 0.9
70
- train_size = int(TRAIN_SPLIT_RATIO * len(messages))
71
- eval_size = len(messages) - train_size
72
-
73
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
74
-
75
- # The model to optimise
76
- model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
77
-
78
-
79
-
80
- ### Real Dataset
81
- train_dataset = Dataset.from_dict({"messages":messages[:train_size]})
82
- eval_dataset = Dataset.from_dict({"messages":messages[train_size:]})
83
- training_args = SFTConfig(
84
- output_dir=args.output_dir,
85
- max_seq_length=args.max_length,
86
- per_device_train_batch_size=args.per_device_train_batch_size,
87
- gradient_accumulation_steps=args.gradient_accumulation_steps,
88
- gradient_checkpointing = args.gradient_checkpointing,
89
- save_steps = 100,
90
- save_total_limit = 5
91
- )
92
-
93
- lora_config = LoraConfig(
94
- r=16,
95
- lora_alpha=32,
96
- lora_dropout=0.05,
97
- bias="none",
98
- task_type="CAUSAL_LM",
99
- )
100
-
101
- response_template = "<|im_start|>assistant\n"
102
-
103
- collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
104
-
105
- trainer = SFTTrainer(
106
- model=model,
107
- args=training_args,
108
- processing_class=tokenizer,
109
- train_dataset=train_dataset,
110
- eval_dataset=eval_dataset,
111
- peft_config=lora_config if args.lora else None,
112
- data_collator=collator,
113
- )
114
- trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
115
- ```
116
-
117
- ### Dataset:
118
- - **Source:** `amphora/QwQ-LongCoT-130K`
119
- - **Split:** 90% Training, 10% Evaluation
120
-
121
- ---
122
-
123
- ## Example Usage:
124
-
125
- ```python
126
- import torch
127
- from transformers import AutoModelForCausalLM, AutoTokenizer
128
- # Model name
129
- model_name = "kz919/QwQ-0.5B-Distilled-SFT"
130
- # Load the model
131
- print(f"Starting to load the model {model_name} into memory")
132
- model = AutoModelForCausalLM.from_pretrained(
133
- model_name,
134
- torch_dtype=torch.bfloat16,
135
- device_map={"": 0}
136
- )
137
- # Load the tokenizer
138
- tokenizer = AutoTokenizer.from_pretrained(model_name)
139
- # Define the prompt
140
- prompt = "How many r in strawberry."
141
- messages = [
142
- {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
143
- {"role": "user", "content": prompt}
144
- ]
145
- # Tokenize the input
146
- text = tokenizer.apply_chat_template(
147
- messages,
148
- tokenize=False,
149
- add_generation_prompt=True
150
- )
151
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
152
- # Generate a response
153
- generated_ids = model.generate(
154
- **model_inputs,
155
- max_new_tokens=4096
156
- )
157
- generated_ids = [
158
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
159
- ]
160
- # Decode the response
161
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
162
- print(response)
163
- ```
164
-
165
- ---
166
-
167
- ## Applications:
168
-
169
- 1. **Conversational Assistants:**
170
- Suitable for AI chatbots that require reasoning and long-context understanding.
171
-
172
- 2. **Educational Tools:**
173
- Provides step-by-step explanations, making it ideal for learning environments.
174
-
175
- 3. **Creative Writing:**
176
- Assists in generating coherent, contextually aware long-form content.
177
-
178
- 4. **Technical Support:**
179
- Handles complex customer queries with precision and clarity.
180
-
181
- ---
182
-
183
- ## Limitations:
184
-
185
- - While distilled for efficiency, performance on highly complex reasoning tasks may slightly trail the teacher model.
186
- - This model could still be under trained, merely a proof of concept. Don't yell at me if it's outputing nonesense.
187
- ---
188
-
189
- ## Citation:
190
-
191
- If you use this model in your research or applications, please cite it as:
192
-
193
- ```bibtex
194
- @model{qwq_0.5B_distilled,
195
- author = {Kaizhao Liang},
196
- title = {Mini-QwQ: A Reasoning Model for Edge Devices},
197
- year = {2024},
198
- publisher = {Hugging Face},
199
- version = {1.0}
200
- }
201
- ```
202
-
203
- ---
204
-
 
 
 
 
 
 
 
 
 
 
 
 
205
  This model is an example of how efficient fine-tuning and distillation methods can deliver robust conversational AI capabilities in a smaller, more manageable footprint.
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - amphora/QwQ-LongCoT-130K-2
5
+ - PowerInfer/QWQ-LONGCOT-500K
6
+ - PowerInfer/LONGCOT-Refine-500K
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ metrics:
22
+ - perplexity
23
+ base_model:
24
+ - Qwen/Qwen2.5-0.5B-Instruct
25
+ library_name: transformers
26
+ ---
27
+ ## Model Details:
28
+
29
+ - **Base Model:** Qwen/Qwen2-0.5B-Instruct
30
+ - **Teacher Model:** Qwen/QwQ-32B-Preview
31
+ - **Distillation Framework:** Instruction Tuning
32
+ - **Task Type:** Conversational AI / Causal Language Modeling
33
+ - **Parameters:** 0.5B
34
+ - **Special Features:**
35
+ - Integrated gradient checkpointing for efficient training
36
+ - Step-by-step reasoning capabilities for better problem-solving
37
+
38
+ ---
39
+
40
+ ## Training:
41
+
42
+ QwQ-0.5B-Distilled was trained using the **QwQ-LongCoT-130K dataset**, a carefully curated collection of long-context examples designed for reasoning and conversational AI tasks. The GKD framework ensures that the student model mimics the teacher model’s outputs, aligning its predictions with high-quality responses.
43
+ ### Training Progress:
44
+ [▓▓▓▓▓▓▓▓▓▓] 100%
45
+
46
+ ### Training Script:
47
+
48
+ ```python
49
+ import os
50
+ import argparse
51
+ import torch
52
+ from datasets import Dataset
53
+ from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
54
+ from transformers import (
55
+ AutoModelForCausalLM,
56
+ AutoTokenizer,
57
+ )
58
+ from datasets import load_dataset
59
+ from peft import LoraConfig
60
+
61
+ parser = argparse.ArgumentParser()
62
+ parser.add_argument("--max_length", type=int, default = 4096)
63
+ parser.add_argument("--output_dir", type=str, default="gkd-model")
64
+ parser.add_argument("--per_device_train_batch_size", type=int, default=1)
65
+ parser.add_argument("--gradient_accumulation_steps", type=int, default=16)
66
+ parser.add_argument("--gradient_checkpointing", action="store_true", default=False)
67
+ parser.add_argument("--resume_from_checkpoint", action="store_true", default=False)
68
+ parser.add_argument("--lora", action="store_true")
69
+ args = parser.parse_args()
70
+
71
+ qwq_dataset = load_dataset("amphora/QwQ-LongCoT-130K-2", split = "train")
72
+ messages = []
73
+ for each in qwq_dataset:
74
+ msg = [
75
+ {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
76
+ {"role": "user", "content": each["problem"]},
77
+ {"role": "assistant", "content": each["qwq"]},
78
+ ]
79
+ messages.append(msg)
80
+
81
+ TRAIN_SPLIT_RATIO = 0.9
82
+ train_size = int(TRAIN_SPLIT_RATIO * len(messages))
83
+ eval_size = len(messages) - train_size
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
86
+
87
+ # The model to optimise
88
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
89
+
90
+
91
+
92
+ ### Real Dataset
93
+ train_dataset = Dataset.from_dict({"messages":messages[:train_size]})
94
+ eval_dataset = Dataset.from_dict({"messages":messages[train_size:]})
95
+ training_args = SFTConfig(
96
+ output_dir=args.output_dir,
97
+ max_seq_length=args.max_length,
98
+ per_device_train_batch_size=args.per_device_train_batch_size,
99
+ gradient_accumulation_steps=args.gradient_accumulation_steps,
100
+ gradient_checkpointing = args.gradient_checkpointing,
101
+ save_steps = 100,
102
+ save_total_limit = 5
103
+ )
104
+
105
+ lora_config = LoraConfig(
106
+ r=16,
107
+ lora_alpha=32,
108
+ lora_dropout=0.05,
109
+ bias="none",
110
+ task_type="CAUSAL_LM",
111
+ )
112
+
113
+ response_template = "<|im_start|>assistant\n"
114
+
115
+ collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
116
+
117
+ trainer = SFTTrainer(
118
+ model=model,
119
+ args=training_args,
120
+ processing_class=tokenizer,
121
+ train_dataset=train_dataset,
122
+ eval_dataset=eval_dataset,
123
+ peft_config=lora_config if args.lora else None,
124
+ data_collator=collator,
125
+ )
126
+ trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
127
+ ```
128
+
129
+ ### Dataset:
130
+ - **Source:** `amphora/QwQ-LongCoT-130K`
131
+ - **Split:** 90% Training, 10% Evaluation
132
+
133
+ ---
134
+
135
+ ## Example Usage:
136
+
137
+ ```python
138
+ import torch
139
+ from transformers import AutoModelForCausalLM, AutoTokenizer
140
+ # Model name
141
+ model_name = "kz919/QwQ-0.5B-Distilled-SFT"
142
+ # Load the model
143
+ print(f"Starting to load the model {model_name} into memory")
144
+ model = AutoModelForCausalLM.from_pretrained(
145
+ model_name,
146
+ torch_dtype=torch.bfloat16,
147
+ device_map={"": 0}
148
+ )
149
+ # Load the tokenizer
150
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
151
+ # Define the prompt
152
+ prompt = "How many r in strawberry."
153
+ messages = [
154
+ {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
155
+ {"role": "user", "content": prompt}
156
+ ]
157
+ # Tokenize the input
158
+ text = tokenizer.apply_chat_template(
159
+ messages,
160
+ tokenize=False,
161
+ add_generation_prompt=True
162
+ )
163
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
164
+ # Generate a response
165
+ generated_ids = model.generate(
166
+ **model_inputs,
167
+ max_new_tokens=4096
168
+ )
169
+ generated_ids = [
170
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
171
+ ]
172
+ # Decode the response
173
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
174
+ print(response)
175
+ ```
176
+
177
+ ---
178
+
179
+ ## Applications:
180
+
181
+ 1. **Conversational Assistants:**
182
+ Suitable for AI chatbots that require reasoning and long-context understanding.
183
+
184
+ 2. **Educational Tools:**
185
+ Provides step-by-step explanations, making it ideal for learning environments.
186
+
187
+ 3. **Creative Writing:**
188
+ Assists in generating coherent, contextually aware long-form content.
189
+
190
+ 4. **Technical Support:**
191
+ Handles complex customer queries with precision and clarity.
192
+
193
+ ---
194
+
195
+ ## Limitations:
196
+
197
+ - While distilled for efficiency, performance on highly complex reasoning tasks may slightly trail the teacher model.
198
+ - This model could still be under trained, merely a proof of concept. Don't yell at me if it's outputing nonesense.
199
+ ---
200
+
201
+ ## Citation:
202
+
203
+ If you use this model in your research or applications, please cite it as:
204
+
205
+ ```bibtex
206
+ @model{qwq_0.5B_distilled,
207
+ author = {Kaizhao Liang},
208
+ title = {Mini-QwQ: A Reasoning Model for Edge Devices},
209
+ year = {2024},
210
+ publisher = {Hugging Face},
211
+ version = {1.0}
212
+ }
213
+ ```
214
+
215
+ ---
216
+
217
  This model is an example of how efficient fine-tuning and distillation methods can deliver robust conversational AI capabilities in a smaller, more manageable footprint.