danielhanchen commited on
Commit
aa3a343
·
verified ·
1 Parent(s): 06c6aeb

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +282 -3
README.md CHANGED
@@ -1,3 +1,282 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - unsloth
4
+ license: llama3.1
5
+ library_name: transformers
6
+ base_model:
7
+ - deepcogito/cogito-v2-preview-llama-405B
8
+ pipeline_tag: text-generation
9
+ ---
10
+ > [!NOTE]
11
+ > Includes Unsloth **chat template fixes**! <br> For `llama.cpp`, use `--jinja`
12
+ >
13
+
14
+ <div>
15
+ <p style="margin-top: 0;margin-bottom: 0;">
16
+ <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em>
17
+ </p>
18
+ <div style="display: flex; gap: 5px; align-items: center; ">
19
+ <a href="https://github.com/unslothai/unsloth/">
20
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
21
+ </a>
22
+ <a href="https://discord.gg/unsloth">
23
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
24
+ </a>
25
+ <a href="https://docs.unsloth.ai/">
26
+ <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
27
+ </a>
28
+ </div>
29
+ </div>
30
+
31
+
32
+ <p align="center">
33
+ <img src="images/deep-cogito-logo.png" alt="Logo" width="40%">
34
+ </p>
35
+
36
+
37
+ # Cogito v2 preview - 405B
38
+
39
+ [Blog Post](https://www.deepcogito.com/research/cogito-v2-preview)
40
+
41
+ The Cogito v2 LLMs are instruction tuned generative models. All models are released under an open license for commercial use.
42
+
43
+ - Cogito v2 models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
44
+ - The LLMs are trained using **Iterated Distillation and Amplification (IDA)** - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
45
+ - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts.
46
+ - In both standard and reasoning modes, Cogito v2-preview models outperform their size equivalent counterparts on common industry benchmarks.
47
+ - This model is trained in over 30 languages and supports a context length of 128k.
48
+
49
+ # Evaluations
50
+ Here is the model performance on some standard industry benchmarks:
51
+
52
+ <p align="left">
53
+ <img src="images/cogito-v2-405b-benchmarks.png" alt="Logo" width="90%">
54
+ </p>
55
+
56
+ For detailed evaluations, please refer to the [Blog Post](https://www.deepcogito.com/research/cogito-v2-preview).
57
+
58
+ # Usage
59
+ Here is a snippet below for usage with Transformers:
60
+
61
+ ```python
62
+ import transformers
63
+ import torch
64
+
65
+ model_id = "deepcogito/cogito-v2-preview-llama-405B"
66
+
67
+ pipeline = transformers.pipeline(
68
+ "text-generation",
69
+ model=model_id,
70
+ model_kwargs={"torch_dtype": torch.bfloat16},
71
+ device_map="auto",
72
+ )
73
+
74
+ messages = [
75
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
76
+ {"role": "user", "content": "Give me a short introduction to LLMs."},
77
+ ]
78
+
79
+ outputs = pipeline(
80
+ messages,
81
+ max_new_tokens=512,
82
+ )
83
+
84
+ print(outputs[0]["generated_text"][-1])
85
+ ```
86
+
87
+
88
+
89
+ ## Implementing extended thinking
90
+ - By default, the model will answer in the standard mode.
91
+ - To enable thinking, you can do any one of the two methods:
92
+ - Set `enable_thinking=True` while applying the chat template.
93
+ - Add a specific system prompt, along with prefilling the response with "\<think\>\n".
94
+
95
+ **NOTE: Unlike Cogito v1 models, we initiate the response with "\<think\>\n" at the beginning of every output when reasoning is enabled. This is because hybrid models can be brittle at times (<0.1% of the cases), and adding a "\<think\>\n" ensures that the model does indeed respect thinking.**
96
+
97
+ ### Method 1 - Set enable_thinking=True in the tokenizer
98
+ If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template).
99
+
100
+ Here is an example -
101
+ ```python
102
+ from transformers import AutoModelForCausalLM, AutoTokenizer
103
+
104
+ model_name = "deepcogito/cogito-v2-preview-llama-405B"
105
+
106
+ model = AutoModelForCausalLM.from_pretrained(
107
+ model_name,
108
+ torch_dtype="auto",
109
+ device_map="auto"
110
+ )
111
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
112
+
113
+ prompt = "Give me a short introduction to LLMs."
114
+ messages = [
115
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
116
+ {"role": "user", "content": prompt}
117
+ ]
118
+
119
+ text = tokenizer.apply_chat_template(
120
+ messages,
121
+ tokenize=False,
122
+ add_generation_prompt=True,
123
+ enable_thinking=True
124
+ )
125
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
126
+
127
+ generated_ids = model.generate(
128
+ **model_inputs,
129
+ max_new_tokens=512
130
+ )
131
+ generated_ids = [
132
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
133
+ ]
134
+
135
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
136
+ print(response)
137
+ ```
138
+
139
+ ### Method 2 - Add a specific system prompt, along with prefilling the response with "\<think\>\n".
140
+ To enable thinking using this method, you need to do two parts -
141
+
142
+
143
+ Step 1 - Simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'`
144
+
145
+ If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`.
146
+
147
+ Step 2 - Prefil the response with the tokens `"<think>\n"`.
148
+
149
+ Here is an example -
150
+
151
+ ```python
152
+ import transformers
153
+ import torch
154
+
155
+ model_name = "deepcogito/cogito-v2-preview-llama-405B"
156
+
157
+ model = AutoModelForCausalLM.from_pretrained(
158
+ model_name,
159
+ torch_dtype="auto",
160
+ device_map="auto"
161
+ )
162
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
163
+
164
+ # Step 1 - Add deep thinking instruction.
165
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
166
+
167
+ messages = [
168
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION},
169
+ {"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
170
+ ]
171
+
172
+ text = tokenizer.apply_chat_template(
173
+ messages,
174
+ tokenize=False,
175
+ add_generation_prompt=True
176
+ )
177
+
178
+ # Step 2 - Prefill response with "<think>\n".
179
+ text += "<think>\n"
180
+
181
+ # Now, continue as usual.
182
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
183
+
184
+ generated_ids = model.generate(
185
+ **model_inputs,
186
+ max_new_tokens=512
187
+ )
188
+ generated_ids = [
189
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
190
+ ]
191
+
192
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
193
+ print(response)
194
+ ```
195
+
196
+
197
+ Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way -
198
+
199
+ ```python
200
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
201
+
202
+ system_prompt = "Reply to each prompt with only the actual code - no explanations."
203
+ prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."
204
+
205
+ messages = [
206
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
207
+ {"role": "user", "content": prompt}
208
+ ]
209
+ ```
210
+
211
+
212
+ # Tool Calling
213
+ Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.
214
+
215
+ Here is a snippet -
216
+
217
+ ```python
218
+ # First, define a tool
219
+ def get_current_temperature(location: str) -> float:
220
+ """
221
+ Get the current temperature at a location.
222
+
223
+ Args:
224
+ location: The location to get the temperature for, in the format "City, Country"
225
+ Returns:
226
+ The current temperature at the specified location in the specified units, as a float.
227
+ """
228
+ return 22. # A real function should probably actually get the temperature!
229
+
230
+ # Next, create a chat and apply the chat template
231
+ messages = [
232
+ {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
233
+ ]
234
+
235
+ model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
236
+
237
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
238
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
239
+ outputs = model.generate(**inputs, max_new_tokens=512)
240
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
241
+ print(output_text)
242
+ ```
243
+
244
+ This will result in the output -
245
+ ```
246
+ <tool_call>
247
+ {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
248
+ </tool_call><|eot_id|>
249
+ ```
250
+
251
+ You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
252
+
253
+ ```python
254
+ tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
255
+ messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
256
+ ```
257
+
258
+ and then call the tool and append the result, with the `tool` role, like so:
259
+
260
+ ```python
261
+ messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
262
+ ```
263
+
264
+ After that, you can `generate()` again to let the model use the tool result in the chat:
265
+
266
+ ```python
267
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
268
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
269
+ outputs = model.generate(**inputs, max_new_tokens=512)
270
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
271
+ ```
272
+
273
+ This should result in the string -
274
+ ```
275
+ 'The current temperature in Paris is 22.0 degrees.<|eot_id|>'
276
+ ```
277
+
278
+ ## License
279
+ This repository and the model weights are licensed under the [Llama 3.1 Community License Agreement](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) (Llama models' default license agreement).
280
+
281
+ ## Contact
282
+ If you would like to reach out to our team, send an email to [[email protected]]([email protected]).