Safetensors
qwen3
ehartford commited on
Commit
beb3696
·
verified ·
1 Parent(s): 0127da0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -6
README.md CHANGED
@@ -80,18 +80,143 @@ down_proj: [5120, 25600] → [8192, 29568]
80
 
81
  ## Usage
82
 
83
- This is an intermediate checkpoint. To use the complete 72B model:
84
 
 
85
  ```python
86
  from transformers import AutoModelForCausalLM, AutoTokenizer
87
 
88
- # Load the complete model instead
 
 
 
89
  model = AutoModelForCausalLM.from_pretrained(
90
- "Qwen3-72B-Embiggened",
91
- torch_dtype=torch.bfloat16,
92
- device_map="auto",
93
- trust_remote_code=True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ```
96
 
97
  ## Hardware Requirements
 
80
 
81
  ## Usage
82
 
 
83
 
84
+ ### Basic Usage with Thinking Mode
85
  ```python
86
  from transformers import AutoModelForCausalLM, AutoTokenizer
87
 
88
+ model_name = "cognitivecomputations/Qwen3-58B-Embiggened"
89
+
90
+ # Load the tokenizer and the model
91
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
92
  model = AutoModelForCausalLM.from_pretrained(
93
+ model_name,
94
+ torch_dtype="auto",
95
+ device_map="auto"
96
+ )
97
+
98
+ # Prepare the model input
99
+ prompt = "How many r's are in strawberry?"
100
+ messages = [
101
+ {"role": "user", "content": prompt}
102
+ ]
103
+
104
+ # Apply chat template with thinking mode enabled
105
+ text = tokenizer.apply_chat_template(
106
+ messages,
107
+ tokenize=False,
108
+ add_generation_prompt=True,
109
+ enable_thinking=True # Enable thinking mode (default)
110
+ )
111
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
112
+
113
+ # Generate response
114
+ generated_ids = model.generate(
115
+ **model_inputs,
116
+ max_new_tokens=32768,
117
+ temperature=0.6, # Recommended for thinking mode
118
+ top_p=0.95,
119
+ top_k=20,
120
+ min_p=0
121
  )
122
+
123
+ # Parse thinking content and final response
124
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
125
+
126
+ try:
127
+ # Find </think> token (151668)
128
+ index = len(output_ids) - output_ids[::-1].index(151668)
129
+ except ValueError:
130
+ index = 0
131
+
132
+ thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
133
+ content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
134
+
135
+ print("Thinking content:", thinking_content)
136
+ print("Final answer:", content)
137
+ ```
138
+
139
+ ### Non-Thinking Mode (Efficient General Dialogue)
140
+ ```python
141
+ # Same setup as above...
142
+
143
+ # Apply chat template with thinking mode disabled
144
+ text = tokenizer.apply_chat_template(
145
+ messages,
146
+ tokenize=False,
147
+ add_generation_prompt=True,
148
+ enable_thinking=False # Disable thinking for efficiency
149
+ )
150
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
151
+
152
+ # Generate with non-thinking parameters
153
+ outputs = model.generate(
154
+ **model_inputs,
155
+ max_new_tokens=2048,
156
+ temperature=0.7, # Recommended for non-thinking mode
157
+ top_p=0.8,
158
+ top_k=20,
159
+ min_p=0
160
+ )
161
+ ```
162
+
163
+ ### Advanced: Dynamic Mode Switching
164
+ ```python
165
+ # Use /think and /no_think tags to control behavior
166
+ messages = [
167
+ {"role": "user", "content": "Explain quantum computing /no_think"}, # Quick response
168
+ {"role": "assistant", "content": "Quantum computing uses quantum bits..."},
169
+ {"role": "user", "content": "How does superposition work mathematically? /think"} # Detailed reasoning
170
+ ]
171
+ ```
172
+
173
+ ### vLLM Deployment with Reasoning Support
174
+ ```python
175
+ # Start server with reasoning parser
176
+ # vllm serve cognitivecomputations/Qwen3-58B-Embiggened --enable-reasoning --reasoning-parser deepseek_r1
177
+
178
+ from openai import OpenAI
179
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
180
+
181
+ # Use with thinking mode
182
+ response = client.chat.completions.create(
183
+ model="cognitivecomputations/Qwen3-58B-Embiggened",
184
+ messages=[{"role": "user", "content": "Solve: What is 15% of 250?"}],
185
+ extra_body={"enable_thinking": True}
186
+ )
187
+ ```
188
+
189
+ ### Advanced Usage with Quantization
190
+ ```python
191
+ from transformers import BitsAndBytesConfig
192
+
193
+ # 4-bit quantization for reduced memory usage
194
+ bnb_config = BitsAndBytesConfig(
195
+ load_in_4bit=True,
196
+ bnb_4bit_compute_dtype=torch.bfloat16,
197
+ bnb_4bit_use_double_quant=True,
198
+ )
199
+
200
+ model = AutoModelForCausalLM.from_pretrained(
201
+ "cognitivecomputations/Qwen3-58B-Embiggened",
202
+ quantization_config=bnb_config,
203
+ device_map="auto"
204
+ )
205
+ ```
206
+
207
+ ### Example Outputs with Thinking
208
+
209
+ ```
210
+ Prompt: "How many r's are in strawberry?"
211
+ Thinking: Let me count the r's in "strawberry". S-t-r-a-w-b-e-r-r-y.
212
+ Going through each letter: s(no), t(no), r(yes, 1), a(no), w(no),
213
+ b(no), e(no), r(yes, 2), r(yes, 3), y(no).
214
+ Final answer: There are 3 r's in the word "strawberry".
215
+
216
+ Prompt: "What is the capital of France, and what is it famous for?"
217
+ Final answer (no thinking): Paris is the capital of France. It's famous for
218
+ the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and its rich
219
+ cultural heritage, fashion, and cuisine.
220
  ```
221
 
222
  ## Hardware Requirements