anemll commited on
Commit
69f2e2c
·
verified ·
1 Parent(s): 41920b0

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - coreml
5
+ - ANE
6
+ - LLaMA
7
+ - Qwen
8
+ - DeepSeek
9
+ - Apple
10
+ - Apple Neural Engine
11
+ - DeepHermes
12
+ ---
13
+ # ANEMLL
14
+
15
+ **ANEMLL** (pronounced like "animal") is an open-source project focused on accelerating the porting of Large Language Models (LLMs) to tensor processors, starting with the Apple Neural Engine (ANE).
16
+
17
+ The goal is to provide a fully open-source pipeline from model conversion to inference for common LLM architectures running on ANE.
18
+
19
+ This enables seamless integration and on-device inference for low-power applications on edge devices, ensuring maximum privacy and security.
20
+
21
+ This is critical for autonomous applications, where models run directly on the device without requiring an internet connection.
22
+
23
+ For more information, visit the [ANEMLL GitHub repository](https://github.com/anemll/anemll).
24
+
25
+
26
+ ---
27
+
28
+ ## License
29
+
30
+ ANEMLL is licensed under the [MIT License](https://opensource.org/license/mit).
31
+ The original model may require a separate license depending on the architecture:
32
+ - LLaMA models: Based on Meta's LLaMA and may require Meta's license
33
+ - Qwen models: Based on Alibaba's Qwen and may require Alibaba's license
34
+ - Other models: Check respective original model licenses
35
+
36
+ This model is converted for CoreML using ANEMLL's open-source conversion pipeline. It supports multiple LLM architectures including LLaMA, Qwen, and DeepSeek variants.
37
+
38
+ ---
39
+
40
+ ## Requirements
41
+
42
+ - **macOS Sequoia** with Apple Neural Engine and 8GB RAM or more
43
+ - **CoreML Tools** and **HuggingFace Transformers** libraries
44
+ - **Python 3.9**
45
+
46
+ `chat.py` provides a sample inference script.
47
+ `chat_full.py` provides a sample inference script with history and conversation management.
48
+
49
+ **Installation**
50
+
51
+ 1. Download the model from Hugging Face:
52
+ ```bash
53
+ # Install required tools
54
+ pip install huggingface_hub
55
+
56
+ # Install Git LFS (Large File Support)
57
+ # macOS with Homebrew:
58
+ brew install git-lfs
59
+ # Or Ubuntu/Debian:
60
+ # sudo apt-get install git-lfs
61
+
62
+ # Initialize Git LFS
63
+ git lfs install
64
+
65
+ # Clone the repository with model files
66
+ git clone https://huggingface.co/anemll/anemll-Qwen3-1.7B-LUT6-ctx100
67
+ ```
68
+
69
+ 2. Extract model files:
70
+ ```bash
71
+ # Navigate to cloned directory
72
+ cd anemll-Qwen3-1.7B-LUT6-ctx100
73
+
74
+ # Pull LFS files (model weights)
75
+ git lfs pull
76
+
77
+ # Extract CoreML model files
78
+ find . -type f -name "*.zip" -exec unzip {} \;
79
+ ```
80
+
81
+ 3. Install dependencies:
82
+ ```bash
83
+ pip install coremltools transformers
84
+ ```
85
+
86
+ **Coremltools:**
87
+
88
+ See coremltools installation guide at https://coremltools.readme.io/v4.0/docs/installation
89
+
90
+ **How to Run**
91
+
92
+ 1. Basic chat interface:
93
+ ```bash
94
+ python chat.py --meta ./meta.yaml
95
+ ```
96
+
97
+ 2. Full conversation mode with history:
98
+ ```bash
99
+ python chat_full.py --meta ./meta.yaml
100
+ ```
101
+
102
+ > Note: The first time the model loads, macOS will take some time to place it on the device.
103
+ > Subsequent loads will be instantaneous.
104
+ > Use Ctrl-D to exit, Ctrl-C to interrupt inference.
105
+
106
+ **More Info**
107
+ Please check following links for later updates:
108
+
109
+ * [GitHub](https://github.com/anemll)
110
+ * [Hugging Face Models](https://huggingface.co/anemll)
111
+ * [Twitter/X](https://x.com/anemll)
112
+ * [Website](https://anemll.com)
113
+
114
+
115
116
+
117
+ # anemll-Qwen3-1.7B-LUT6-ctx100
118
+
119
+ This is a CoreML model converted using ANEMLL for Apple Neural Engine inference.
120
+
121
+ ## Available Distributions
122
+
123
+ ### Standard Distribution
124
+ - Contains zipped MLMODELC files
125
+ - Suitable for macOS and development
126
+
127
+ ### iOS Distribution
128
+ - Contains unzipped MLMODELC files
129
+ - Ready for iOS deployment
130
+ - Includes offline tokenizer support
131
+
132
+ ## Model Information
133
+ - Context Length: 1024
134
+ - Batch Size: 64
135
+ - Number of Chunks: 1
136
+
137
+ ## Quick Start
138
+
139
+ ### Test in iOS/macOS App
140
+ Try our sample Chat-Bot app on TestFlight:
141
+ 1. Install TestFlight from App Store
142
+ 2. Join beta test: [TestFlight Link](https://testflight.apple.com/join/jrQq1D1C)
143
+ 3. App includes a small demo model pre-installed
144
+ 4. You can add custom models via HuggingFace URLs
145
+
146
+ > [!Note]
147
+ > - The TestFlight app works on both iOS and macOS
148
+ > - Demonstrates proper model integration and provides a reference implementation
149
+ > - iOS requires unzipped MLMODELC files and config.json for offline tokenizer
150
+ > - macOS supports both zipped and unzipped model formats
151
+
152
+ ```
chat.py ADDED
@@ -0,0 +1,989 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # chat.py
2
+ #!/usr/bin/env python3
3
+ # chat.py
4
+ # Copyright (c) 2025 Anemll
5
+ # Licensed under the MIT License
6
+
7
+ import argparse
8
+ import os
9
+ import re
10
+ import glob
11
+ from pathlib import Path
12
+ import coremltools as ct
13
+ from transformers import LlamaTokenizer, AutoTokenizer
14
+ import torch
15
+ import torch.nn.functional as F
16
+ import numpy as np
17
+ import queue
18
+ import threading
19
+ import time
20
+ import yaml
21
+ import sys
22
+
23
+ # ANSI color codes
24
+ LIGHT_BLUE = "\033[94m"
25
+ DARK_BLUE = "\033[34m"
26
+ LIGHT_GREEN = "\033[92m"
27
+ RESET_COLOR = "\033[0m"
28
+
29
+ # Add at top with other constants
30
+ WARMUP_TOKEN_LIMIT = 10 # Maximum tokens to generate during warmup
31
+
32
+ class TokenPrinter:
33
+ """Handles background printing of generated tokens."""
34
+ def __init__(self, tokenizer):
35
+ self.tokenizer = tokenizer
36
+ self.token_queue = queue.Queue()
37
+ self.stop_event = threading.Event()
38
+ self.thread = None
39
+ self.buffer = ""
40
+ self.lock = threading.Lock()
41
+ self.thinking = True # Track if we're still in thinking mode
42
+ self.decoding_buffer = [] # Buffer for token IDs
43
+ # Add token counting and timing
44
+ self.start_time = time.time()
45
+ self.token_count = 0
46
+ self.start()
47
+
48
+ def start(self):
49
+ """Start the printer thread."""
50
+ if self.thread is None:
51
+ self.thread = threading.Thread(target=self._print_worker)
52
+ self.thread.daemon = True
53
+ self.thread.start()
54
+
55
+ def add_token(self, token_id):
56
+ """Add a token to the print queue."""
57
+ if not self.stop_event.is_set():
58
+ self.token_queue.put(token_id)
59
+ self.token_count += 1
60
+
61
+ def drain_buffer(self):
62
+ """Decode token IDs from decoding_buffer in the main thread."""
63
+ if not self.decoding_buffer:
64
+ return
65
+
66
+ # Decode all tokens at once in the main thread
67
+ token_str = self.tokenizer.decode(self.decoding_buffer)
68
+ self.decoding_buffer.clear()
69
+
70
+ # Store the text in buffer for later saving to file
71
+ with self.lock:
72
+ self.buffer += token_str
73
+
74
+ # Color-handling logic
75
+ if self.thinking and "</think>" in token_str:
76
+ self.thinking = False
77
+ parts = token_str.split("</think>")
78
+ if len(parts) > 0:
79
+ print(parts[0] + "</think>", end='', flush=True)
80
+ if len(parts) > 1:
81
+ print(LIGHT_BLUE + parts[1], end='', flush=True)
82
+ else:
83
+ if not self.thinking:
84
+ print(LIGHT_BLUE + token_str, end='', flush=True)
85
+ else:
86
+ print(token_str, end='', flush=True)
87
+
88
+ def _print_worker(self):
89
+ """Worker thread that takes token_ids from the queue."""
90
+ while not self.stop_event.is_set():
91
+ try:
92
+ token_id = self.token_queue.get(timeout=0.01)
93
+ with self.lock:
94
+ self.decoding_buffer.append(token_id)
95
+ self.token_queue.task_done()
96
+ except queue.Empty:
97
+ continue
98
+ except Exception as e:
99
+ print(f"\nError: Token printer error: {str(e)}")
100
+ break
101
+
102
+ def stop(self):
103
+ """Stop the printer thread."""
104
+ if self.thread and self.thread.is_alive():
105
+ # Ensure any remaining tokens are processed
106
+ self.drain_buffer()
107
+ self.stop_event.set()
108
+ try:
109
+ self.thread.join(timeout=1.0)
110
+ except Exception:
111
+ pass
112
+ # Calculate and print tokens/s with shorter format in blue
113
+ elapsed = time.time() - self.start_time
114
+ if elapsed > 0 and self.token_count > 0:
115
+ tokens_per_sec = self.token_count / elapsed
116
+ print(f"\n{DARK_BLUE}{tokens_per_sec:.1f} t/s{RESET_COLOR}")
117
+ else:
118
+ print(RESET_COLOR) # Reset color at the end
119
+ return self.buffer
120
+
121
+ def parse_model_path(path):
122
+ """Parse model path and return full path with .mlmodelc or .mlpackage extension."""
123
+ path = Path(path)
124
+
125
+ # If path exists exactly as specified, return it
126
+ if path.exists():
127
+ return str(path)
128
+
129
+ # Try with both extensions
130
+ candidates = [
131
+ path, # Original path
132
+ path.with_suffix('.mlmodelc'), # With .mlmodelc
133
+ path.with_suffix('.mlpackage'), # With .mlpackage
134
+ Path(str(path) + '.mlmodelc'), # Handle case where extension is included
135
+ Path(str(path) + '.mlpackage')
136
+ ]
137
+
138
+ # Try all possible paths
139
+ for candidate in candidates:
140
+ if candidate.exists():
141
+ print(f"Found model at: {candidate}")
142
+ return str(candidate)
143
+
144
+ # If embeddings with LUT suffix not found, try without LUT suffix
145
+ if "_lut" in str(path) and "embeddings" in str(path):
146
+ print(f"Failed to find {path}, trying without LUT suffix...")
147
+ # Remove LUT suffix
148
+ path_no_lut = str(path).split("_lut")[0]
149
+ path_no_lut = Path(path_no_lut)
150
+
151
+ # Try candidates without LUT suffix
152
+ candidates_no_lut = [
153
+ path_no_lut,
154
+ path_no_lut.with_suffix('.mlmodelc'),
155
+ path_no_lut.with_suffix('.mlpackage'),
156
+ Path(str(path_no_lut) + '.mlmodelc'),
157
+ Path(str(path_no_lut) + '.mlpackage')
158
+ ]
159
+
160
+ for candidate in candidates_no_lut:
161
+ if candidate.exists():
162
+ print(f"Found model at: {candidate}")
163
+ return str(candidate)
164
+
165
+ # Add no-LUT candidates to the list for error reporting
166
+ candidates.extend(candidates_no_lut)
167
+
168
+ # If we get here, no valid path was found
169
+ print("\nError: Model not found. Tried following paths:")
170
+ for candidate in candidates:
171
+ print(f" {candidate}")
172
+ raise FileNotFoundError(f"Model not found: {path}")
173
+
174
+ def parse_ffn_filename(path):
175
+ """Parse FFN model filename to extract chunk information."""
176
+ path = Path(path)
177
+ pattern = r'FFN_PF.*_chunk_(\d+)of(\d+)'
178
+ match = re.search(pattern, path.name)
179
+
180
+ if match:
181
+ current_chunk = int(match.group(1))
182
+ total_chunks = int(match.group(2))
183
+ return current_chunk, total_chunks
184
+ return None, None
185
+
186
+ def find_all_chunks(base_path):
187
+ """Find all chunk files matching the base FFN path pattern."""
188
+ path = Path(base_path)
189
+ pattern = re.sub(r'_chunk_\d+of\d+', '_chunk_*', str(path))
190
+ return sorted(glob.glob(pattern))
191
+
192
+ def load_model(path, function_name=None):
193
+ """Load a CoreML model, handling both .mlmodelc and .mlpackage formats."""
194
+ path = Path(path)
195
+ compute_unit = ct.ComputeUnit.CPU_AND_NE
196
+
197
+ try:
198
+ if path.suffix == '.mlmodelc':
199
+ # For compiled models (.mlmodelc), use CompiledMLModel
200
+ if function_name:
201
+ return ct.models.CompiledMLModel(str(path), compute_unit, function_name=function_name)
202
+ else:
203
+ return ct.models.CompiledMLModel(str(path), compute_unit)
204
+ else:
205
+ # For packages (.mlpackage)
206
+ if function_name:
207
+ return ct.models.MLModel(str(path), function_name=function_name)
208
+ else:
209
+ return ct.models.MLModel(str(path))
210
+
211
+ except RuntimeError as e:
212
+ if "valid manifest does not exist" in str(e):
213
+ print(f"\nError: Could not load compiled model at {path}")
214
+ print("This might be because:")
215
+ print("1. The model is not properly compiled")
216
+ print("2. The model was compiled for a different OS version")
217
+ print("3. The model needs to be recompiled")
218
+ print("\nTry using the .mlpackage version instead, or recompile the model.")
219
+ raise
220
+
221
+ def load_metadata(model,args):
222
+ # Extract metadata and config parameters
223
+ metadata = {}
224
+ if hasattr(model, 'user_defined_metadata'):
225
+ meta = model.user_defined_metadata
226
+
227
+ # Extract key parameters with defaults
228
+ metadata['context_length'] = int(meta.get('com.anemll.context_length', 512))
229
+ metadata['state_length'] = int(meta.get('com.anemll.state_length', metadata['context_length'])) # Added state_length
230
+ metadata['batch_size'] = int(meta.get('com.anemll.batch_size', 64))
231
+ metadata['lut_bits'] = int(meta.get('com.anemll.lut_bits', 0))
232
+ metadata['num_chunks'] = int(meta.get('com.anemll.num_chunks', 1))
233
+
234
+ print("\nExtracted Parameters:")
235
+ print(f" Context Length: {metadata['context_length']}")
236
+ print(f" State Length: {metadata['state_length']}")
237
+ print(f" Prefill Batch Size: {metadata['batch_size']}")
238
+ print(f" LUT Bits: {metadata['lut_bits']}")
239
+ print(f" Number of Chunks: {metadata['num_chunks']}")
240
+
241
+ # Print model info
242
+ print("\nModel Info:")
243
+ if 'com.anemll.info' in meta:
244
+ print(f" {meta['com.anemll.info']}")
245
+ if 'com.github.apple.coremltools.version' in meta:
246
+ print(f" CoreML Tools: {meta['com.github.apple.coremltools.version']}")
247
+
248
+ # Print model input/output shapes
249
+ print("\nModel Shapes:")
250
+ if hasattr(model, 'input_description'):
251
+ print(" Inputs:")
252
+ try:
253
+ if hasattr(model.input_description, 'items'):
254
+ for name, desc in model.input_description.items():
255
+ print(f" {name}: {desc}")
256
+ else:
257
+ print(f" {model.input_description}")
258
+ except:
259
+ print(f" Input description: {type(model.input_description)}")
260
+ if hasattr(model, 'output_description'):
261
+ print(" Outputs:")
262
+ try:
263
+ if hasattr(model.output_description, 'items'):
264
+ for name, desc in model.output_description.items():
265
+ print(f" {name}: {desc}")
266
+ else:
267
+ print(f" {model.output_description}")
268
+ except:
269
+ print(f" Output description: {type(model.output_description)}")
270
+ else:
271
+ print("\nWarning: No metadata found in model")
272
+
273
+ # Check if model directory name contains context length pattern (ctxXXX)
274
+ ctx_len = 512
275
+ if args.context_length is None:
276
+ import re
277
+ ctx_match = re.search(r'ctx(\d+)', str(args.d))
278
+ if ctx_match:
279
+ ctx_len0 = int(ctx_match.group(1))
280
+ if 512 <= ctx_len0 <= 8096:
281
+ ctx_len = ctx_len0
282
+ print(f"\nDetected context length {ctx_len} from directory name")
283
+ else:
284
+ print(f"\nWarning: No context length found in directory {ctx_len} from directory name {args.d}")
285
+ else:
286
+ ctx_len = args.context_length
287
+
288
+ # Use defaults or values from args
289
+ metadata['context_length'] = ctx_len
290
+ metadata['state_length'] = ctx_len
291
+ # Get batch size from args or use default
292
+ metadata['batch_size'] = getattr(args, 'batch_size', 64)
293
+ metadata['lut_bits'] = 4
294
+ metadata['num_chunks'] = getattr(args, 'num_chunks', 4)
295
+ print("\nUsing parameters:")
296
+ print(f" Context Length: {metadata['context_length']}")
297
+ print(f" State Length: {metadata['state_length']}")
298
+ print(f" Prefill Batch Size: {metadata['batch_size']}")
299
+ print(f" LUT Bits: {metadata['lut_bits']}")
300
+ print(f" Number of Chunks: {metadata['num_chunks']}")
301
+
302
+ # Override with values from args if they exist
303
+ if hasattr(args, 'batch_size') and args.batch_size is not None:
304
+ metadata['batch_size'] = args.batch_size
305
+ print(f"\nOverriding batch size from args: {args.batch_size}")
306
+ if hasattr(args, 'num_chunks') and args.num_chunks is not None:
307
+ metadata['num_chunks'] = args.num_chunks
308
+ print(f"\nOverriding num chunks from args: {args.num_chunks}")
309
+
310
+ return metadata
311
+
312
+ def load_models(args,metadata):
313
+ """Load all required models and extract metadata."""
314
+ print("\nLoading models...")
315
+
316
+ try:
317
+ # Load embeddings model
318
+ print("\nLoading embeddings model...")
319
+ embed_path = parse_model_path(args.embed)
320
+ print(f"Loading from: {embed_path}")
321
+ embed_model = load_model(embed_path)
322
+ print("Embeddings model loaded successfully")
323
+ metadata = load_metadata(embed_model,args)
324
+
325
+
326
+
327
+ # Load LM head model
328
+ print("\nLoading LM head model...")
329
+ lmhead_path = parse_model_path(args.lmhead)
330
+ print(f"Loading from: {lmhead_path}")
331
+ lmhead_model = load_model(lmhead_path)
332
+ print("LM head model loaded successfully")
333
+
334
+ # Parse FFN path and find chunks if needed
335
+ print("\nLoading FFN+PREFILL model(s)...")
336
+ ffn_path = parse_model_path(args.ffn)
337
+ chunk_no, total_chunks = parse_ffn_filename(ffn_path)
338
+
339
+ ffn_models = []
340
+ if chunk_no and total_chunks:
341
+ print(f"\nDetected chunked FFN+PREFILL model ({total_chunks} chunks)")
342
+ # Find and load all chunks
343
+ chunk_paths = find_all_chunks(ffn_path)
344
+ if len(chunk_paths) != total_chunks:
345
+ raise ValueError(f"Found {len(chunk_paths)} chunks but filename indicates {total_chunks} chunks")
346
+
347
+ for chunk_path in chunk_paths:
348
+ print(f"\nLoading FFN+PREFILL chunk: {Path(chunk_path).name}")
349
+ try:
350
+ # For chunked models, we need both infer and prefill functions
351
+ ffn_models.append({
352
+ 'infer': load_model(chunk_path, function_name='infer'),
353
+ 'prefill': load_model(chunk_path, function_name='prefill')
354
+ })
355
+ print("Chunk loaded successfully")
356
+ except Exception as e:
357
+ print(f"Error loading chunk {chunk_path}: {str(e)}")
358
+ raise
359
+ metadata = load_metadata(ffn_models[0],args)
360
+
361
+ else:
362
+ print("\nLoading single FFN model...")
363
+ ffn_models.append(load_model(ffn_path))
364
+ print("FFN model loaded successfully")
365
+
366
+ return embed_model, ffn_models, lmhead_model, metadata
367
+
368
+ except Exception as e:
369
+ print(f"\nError loading models: {str(e)}")
370
+ print("\nPlease ensure all model files exist and are accessible.")
371
+ print("Expected files:")
372
+ print(f" Embeddings: {args.embed}")
373
+ print(f" LM Head: {args.lmhead}")
374
+ print(f" FFN: {args.ffn}")
375
+ raise
376
+
377
+ # At the top of the file, make this a default path
378
+
379
+ def initialize_tokenizer(model_path=None):
380
+ """Initialize and configure the tokenizer."""
381
+ try:
382
+
383
+
384
+ tokenizer = AutoTokenizer.from_pretrained(
385
+ str(model_path),
386
+ use_fast=False,
387
+ trust_remote_code=True
388
+ )
389
+
390
+ print("\nTokenizer Configuration:")
391
+ print(f"Tokenizer type: {type(tokenizer)}")
392
+ print(f"Tokenizer name: {tokenizer.__class__.__name__}")
393
+ print(f"Vocabulary size: {len(tokenizer)}")
394
+ print(f"Model max length: {tokenizer.model_max_length}")
395
+
396
+ if tokenizer.pad_token is None:
397
+ tokenizer.pad_token = tokenizer.eos_token
398
+ tokenizer.pad_token_id = tokenizer.eos_token_id
399
+ print("Set PAD token to EOS token")
400
+
401
+ tokenizer.padding_side = "left"
402
+
403
+ print(f"\nSpecial Tokens:")
404
+ print(f"PAD token: '{tokenizer.pad_token}' (ID: {tokenizer.pad_token_id})")
405
+ print(f"EOS token: '{tokenizer.eos_token}' (ID: {tokenizer.eos_token_id})")
406
+ print(f"BOS token: '{tokenizer.bos_token}' (ID: {tokenizer.bos_token_id})")
407
+ print(f"UNK token: '{tokenizer.unk_token}' (ID: {tokenizer.unk_token_id})")
408
+
409
+ return tokenizer
410
+
411
+ except Exception as e:
412
+ print(f"\nError: Failed to load tokenizer from {model_path}")
413
+ print(f"Error details: {str(e)}")
414
+ print(f"Error type: {type(e)}")
415
+ print("\nThis appears to be a tokenizer loading issue.")
416
+
417
+ # Check if it's the specific Qwen tokenizer file issue
418
+ if "expected str, bytes or os.PathLike object, not NoneType" in str(e):
419
+ print("\nThis error suggests the tokenizer files are missing or incomplete.")
420
+ print("For Qwen models, you need the original model directory with tokenizer files.")
421
+ print("Try using: --tokenizer ~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/YOUR_SNAPSHOT_ID")
422
+ else:
423
+ print("Please provide the path to a compatible model directory with tokenizer files.")
424
+ import traceback
425
+ traceback.print_exc()
426
+ raise
427
+
428
+
429
+
430
+ def make_causal_mask(length, start):
431
+ """Create causal attention mask."""
432
+ mask = np.full((1, 1, length, length), -np.inf, dtype=np.float16)
433
+ row_indices = np.arange(length).reshape(length, 1)
434
+ col_indices = np.arange(length).reshape(1, length)
435
+ mask[:, :, col_indices <= (row_indices + start)] = 0
436
+ return mask
437
+
438
+ def initialize_causal_mask(context_length):
439
+ """Initialize causal mask for transformer attention."""
440
+ causal_mask = make_causal_mask(context_length, 0)
441
+ causal_mask = torch.tensor(causal_mask, dtype=torch.float16)
442
+ print(f"\nInitialized causal mask for context length {context_length}")
443
+ return causal_mask
444
+
445
+ def run_prefill(embed_model, ffn_models, input_ids, context_pos, context_length, batch_size=64, state=None, causal_mask=None):
446
+ """Run prefill on the input sequence."""
447
+ # Use provided causal mask or create one if not provided
448
+ if causal_mask is None:
449
+ causal_mask = make_causal_mask(context_length, 0)
450
+ causal_mask = torch.tensor(causal_mask, dtype=torch.float16)
451
+
452
+ # Process in batches
453
+ batch_pos = 0
454
+ while batch_pos < context_pos:
455
+ batch_end = min(batch_pos + batch_size, context_pos)
456
+ current_batch_size = batch_end - batch_pos
457
+
458
+ # Get current batch
459
+ batch_input = input_ids[:, batch_pos:batch_end]
460
+
461
+ # Always pad to full batch size for prefill
462
+ batch_input = F.pad(
463
+ batch_input,
464
+ (0, batch_size - current_batch_size),
465
+ value=0
466
+ )
467
+
468
+ # Generate position IDs for full batch size
469
+ position_ids = torch.arange(batch_pos, batch_pos+batch_size, dtype=torch.int32) # Changed: Always use full batch size
470
+ batch_causal_mask = causal_mask[:, :, batch_pos:batch_pos+batch_size, :] # Changed: Use full batch size
471
+
472
+ # Run embeddings
473
+ hidden_states = torch.from_numpy(
474
+ embed_model.predict({
475
+ 'input_ids': batch_input.numpy()
476
+ })['hidden_states']
477
+ )
478
+
479
+ # Run through FFN chunks with state
480
+ for ffn_model in ffn_models:
481
+ if isinstance(ffn_model, dict):
482
+ inputs = {
483
+ 'hidden_states': hidden_states.numpy(), # [1, 64, hidden_size]
484
+ 'position_ids': position_ids.numpy(), # [64]
485
+ 'causal_mask': batch_causal_mask.numpy(), # [1, 1, 64, context_length]
486
+ 'current_pos': np.array([batch_pos], dtype=np.int32) # [1]
487
+ }
488
+ output = ffn_model['prefill'].predict(inputs, state)
489
+ hidden_states = torch.from_numpy(output['output_hidden_states'])
490
+
491
+ batch_pos = batch_end
492
+
493
+ return torch.tensor([context_pos], dtype=torch.int32)
494
+
495
+ def generate_next_token(embed_model, ffn_models, lmhead_model, input_ids, pos, context_length, metadata, state=None, causal_mask=None, temperature=0.0):
496
+ """Generate the next token."""
497
+ # Get current token
498
+ current_token = input_ids[:, pos-1:pos] # [1, 1]
499
+
500
+ # Run embeddings
501
+ hidden_states = torch.from_numpy(
502
+ embed_model.predict({'input_ids': current_token.numpy()})['hidden_states']
503
+ ) # [1, 1, hidden_size]
504
+
505
+ # Create masks
506
+ update_mask = torch.zeros((1, 1, context_length, 1), dtype=torch.float16)
507
+ update_mask[0, 0, pos-1, 0] = 1.0
508
+ position_ids = torch.tensor([pos-1], dtype=torch.int32) # [1]
509
+
510
+ # Use provided causal mask or create one if not provided
511
+ if causal_mask is None:
512
+ causal_mask_data = make_causal_mask(context_length, 0)
513
+ single_causal_mask = torch.tensor(causal_mask_data[:, :, pos-1:pos, :], dtype=torch.float16) # [1, 1, 1, context_length]
514
+ else:
515
+ single_causal_mask = causal_mask[:, :, pos-1:pos, :]
516
+
517
+ # Run through FFN chunks with state
518
+ for ffn_model in ffn_models:
519
+ if isinstance(ffn_model, dict):
520
+ inputs = {
521
+ 'hidden_states': hidden_states.numpy(),
522
+ 'update_mask': update_mask.numpy(),
523
+ 'position_ids': position_ids.numpy(),
524
+ 'causal_mask': single_causal_mask.numpy(),
525
+ 'current_pos': position_ids.numpy()
526
+ }
527
+ output = ffn_model['infer'].predict(inputs, state)
528
+ hidden_states = torch.from_numpy(output['output_hidden_states'])
529
+
530
+ # Run LM head
531
+ lm_output = lmhead_model.predict({'hidden_states': hidden_states.numpy()})
532
+ # Debug print
533
+ #print("\nLM Head output keys:", list(lm_output.keys()))
534
+
535
+ # Get number of logits from metadata, using split_lm_head if available
536
+ # First check for split_lm_head (new), then num_logits (legacy), default to 8
537
+ num_logits = metadata.get('split_lm_head', metadata.get('num_logits', 8))
538
+
539
+ # Combine logits1-N if they exist
540
+ if 'logits1' in lm_output:
541
+ # Concatenate all logits parts
542
+ logits_parts = []
543
+ for i in range(1, num_logits + 1):
544
+ key = f'logits{i}'
545
+ if key in lm_output:
546
+ logits_parts.append(torch.from_numpy(lm_output[key]))
547
+ logits = torch.cat(logits_parts, dim=-1) # Concatenate along vocab dimension
548
+ else:
549
+ # Try output_logits as fallback
550
+ logits = torch.from_numpy(lm_output['output_logits'])
551
+
552
+ # Apply temperature and sample
553
+ if temperature > 0:
554
+ logits = logits / temperature
555
+ probs = F.softmax(logits[0, -1, :], dim=-1)
556
+ next_token = torch.multinomial(probs, num_samples=1).item()
557
+ else:
558
+ next_token = torch.argmax(logits[0, -1, :]).item()
559
+
560
+ return next_token
561
+
562
+ def create_unified_state(ffn_models, context_length):
563
+ """Create unified KV cache state for transformer."""
564
+ if isinstance(ffn_models[0], dict):
565
+ # Use first FFN model's prefill function to create state
566
+ state = ffn_models[0]['prefill'].make_state()
567
+ print(f"\nCreated unified transformer state for {len(ffn_models)} chunks")
568
+ return state
569
+ else:
570
+ state = ffn_models[0].make_state()
571
+ print("\nCreated unified transformer state")
572
+ return state
573
+
574
+ def chat_loop(embed_model, ffn_models, lmhead_model, tokenizer, metadata, state, causal_mask=None, auto_prompt=None, warmup=False, save_file=None):
575
+ """Interactive chat loop."""
576
+ context_length = metadata.get('context_length')
577
+ batch_size = metadata.get('batch_size', 64)
578
+
579
+ if not warmup:
580
+ print(f"\nUsing context length: {context_length}")
581
+ print("\nStarting chat session. Press Ctrl+D to exit.")
582
+ print("Type your message and press Enter to chat.")
583
+
584
+ # Check if tokenizer has chat template and if it works
585
+ has_chat_template = False
586
+ try:
587
+ # Test if chat template works
588
+ test_messages = [{"role": "user", "content": "test"}]
589
+ tokenizer.apply_chat_template(test_messages, return_tensors="pt")
590
+ has_chat_template = True
591
+ if not warmup:
592
+ print("\nUsing chat template for prompts")
593
+ except:
594
+ if not warmup:
595
+ print("\nUsing manual formatting for prompts")
596
+
597
+ conversation = []
598
+
599
+ try:
600
+ while True:
601
+ try:
602
+ if not warmup:
603
+ print(f"\n{LIGHT_GREEN}You:{RESET_COLOR}", end=' ', flush=True)
604
+ if auto_prompt is not None:
605
+ user_input = auto_prompt
606
+ if not warmup:
607
+ print(user_input)
608
+ else:
609
+ user_input = input().strip()
610
+ except EOFError:
611
+ if not warmup:
612
+ print("\nExiting chat...")
613
+ break
614
+
615
+ if not user_input:
616
+ continue
617
+
618
+ # Format prompt based on tokenizer capabilities
619
+ if has_chat_template:
620
+ messages = [{"role": "user", "content": user_input}]
621
+ input_ids = tokenizer.apply_chat_template(
622
+ messages,
623
+ return_tensors="pt",
624
+ add_generation_prompt=True
625
+ ).to(torch.int32)
626
+ else:
627
+ # Manual formatting for Llama models without chat template
628
+ formatted_prompt = f"[INST] {user_input} [/INST]"
629
+ input_ids = tokenizer(
630
+ formatted_prompt,
631
+ return_tensors="pt",
632
+ add_special_tokens=True
633
+ ).input_ids.to(torch.int32)
634
+
635
+ context_pos = input_ids.size(1)
636
+
637
+ if not warmup:
638
+ print(f"\n{LIGHT_BLUE}Assistant:{RESET_COLOR}", end=' ', flush=True)
639
+
640
+ # Initialize token printer
641
+ token_printer = TokenPrinter(tokenizer)
642
+ tokens_generated = 0 # Track number of tokens
643
+
644
+ try:
645
+ # Start prefill timing
646
+ prefill_start = time.time()
647
+
648
+ # Run prefill with state and causal mask
649
+ # Ensure batch_size is not None
650
+ if batch_size is None:
651
+ batch_size = 64
652
+ print(f"Warning: batch_size was None, using default: {batch_size}")
653
+
654
+ _ = run_prefill(
655
+ embed_model,
656
+ ffn_models,
657
+ input_ids,
658
+ context_pos,
659
+ context_length,
660
+ batch_size,
661
+ state,
662
+ causal_mask
663
+ )
664
+
665
+ # Calculate prefill timing
666
+ prefill_time = time.time() - prefill_start
667
+ prefill_tokens = context_pos # Number of tokens in input
668
+ prefill_tokens_per_sec = prefill_tokens / prefill_time if prefill_time > 0 else 0
669
+
670
+ # Generation loop with state
671
+ input_ids = input_ids
672
+ pos = context_pos
673
+ inference_start = time.time()
674
+ inference_tokens = 0
675
+
676
+ while pos < context_length - 1:
677
+ # Generate next token with causal mask
678
+ next_token = generate_next_token(
679
+ embed_model,
680
+ ffn_models,
681
+ lmhead_model,
682
+ input_ids,
683
+ pos,
684
+ context_length,
685
+ metadata,
686
+ state,
687
+ causal_mask
688
+ )
689
+
690
+ # Add token to sequence
691
+ if pos < input_ids.size(1):
692
+ input_ids[0, pos] = next_token
693
+ else:
694
+ input_ids = torch.cat([
695
+ input_ids,
696
+ torch.tensor([[next_token]], dtype=torch.int32)
697
+ ], dim=1)
698
+
699
+ # Add to printer only if not in warmup
700
+ if not warmup:
701
+ token_printer.add_token(next_token)
702
+ token_printer.drain_buffer()
703
+
704
+ pos += 1
705
+ tokens_generated += 1
706
+ inference_tokens += 1
707
+
708
+ # Check limits
709
+ if warmup and tokens_generated >= WARMUP_TOKEN_LIMIT:
710
+ break
711
+
712
+ if next_token == tokenizer.eos_token_id:
713
+ break
714
+
715
+ # Calculate inference timing
716
+ inference_time = time.time() - inference_start
717
+ inference_tokens_per_sec = inference_tokens / inference_time if inference_time > 0 else 0
718
+
719
+ # Get final response and add to conversation
720
+ if not warmup:
721
+ response = token_printer.stop()
722
+ # Print timing stats
723
+ prefill_ms = prefill_time * 1000 # Convert to milliseconds
724
+ print(f"\nPrefill: {prefill_ms:.1f}ms ({prefill_tokens_per_sec:.1f} t/s)")
725
+ print(f"Inference: {inference_tokens_per_sec:.1f} t/s")
726
+ print(f"Total: Generated {tokens_generated} tokens in {prefill_time + inference_time:.2f}s")
727
+ conversation.append({"role": "assistant", "content": response})
728
+
729
+ # Save response to file if requested
730
+ if save_file:
731
+ try:
732
+ # Add small delay to ensure all tokens are processed
733
+ time.sleep(0.5)
734
+
735
+ # Make sure response ends with EOS token if it's supposed to
736
+ if response and not response.endswith("<|eot_id|>") and not response.endswith("</s>"):
737
+ if tokenizer.eos_token:
738
+ eos_text = tokenizer.decode([tokenizer.eos_token_id])
739
+ if not response.endswith(eos_text):
740
+ print(f"\n{DARK_BLUE}Adding missing EOS token for consistency{RESET_COLOR}")
741
+ response += eos_text
742
+
743
+ with open(save_file, 'w') as f:
744
+ f.write(response)
745
+ print(f"\n{DARK_BLUE}Response saved to file: {save_file}{RESET_COLOR}")
746
+ except Exception as e:
747
+ print(f"\n{DARK_BLUE}Error saving to file: {str(e)}{RESET_COLOR}")
748
+ else:
749
+ token_printer.stop() # Clean up without printing stats
750
+
751
+ # Exit after one response in auto_prompt mode
752
+ if auto_prompt is not None:
753
+ break
754
+
755
+ except KeyboardInterrupt:
756
+ print("\nGeneration interrupted")
757
+ token_printer.stop()
758
+ continue
759
+
760
+ except Exception as e:
761
+ print(f"\nError in chat loop: {str(e)}")
762
+ import traceback
763
+ traceback.print_exc()
764
+
765
+ def parse_args():
766
+ parser = argparse.ArgumentParser(description='Chat with CoreML LLaMA, gil resolved (c) 2025 Anemll')
767
+
768
+ # Add meta.yaml option
769
+ parser.add_argument('--meta', type=str, help='Path to meta.yaml to load all parameters')
770
+
771
+ # Model paths
772
+ parser.add_argument('--d', '--dir', type=str, default='.',
773
+ help='Directory containing model files (default: current directory)')
774
+ parser.add_argument('--embed', type=str, required=False,
775
+ help='Path to embeddings model (relative to --dir)')
776
+ parser.add_argument('--ffn', type=str, required=False,
777
+ help='Path to FFN model (can be chunked, relative to --dir)')
778
+ parser.add_argument('--lmhead', type=str, required=False,
779
+ help='Path to LM head model (relative to --dir)')
780
+ parser.add_argument('--tokenizer', type=str, required=False,
781
+ help='Path to tokenizer')
782
+
783
+ # Add new argument for auto-generation
784
+ parser.add_argument('--prompt', type=str,
785
+ help='If specified, run once with this prompt and exit')
786
+
787
+ # Add save option
788
+ parser.add_argument('--save', type=str,
789
+ help='Save assistant\'s response to specified file')
790
+
791
+ # Add no-warmup flag
792
+ parser.add_argument('--nw', action='store_true',
793
+ help='Skip warmup phase')
794
+
795
+ # Model configuration
796
+ parser.add_argument('--context-length', type=int,
797
+ help='Context length for the model (default: 512), if not provided, it will be detected from the model directory name ctxNUMBER')
798
+ parser.add_argument('--batch-size', type=int,
799
+ help='Batch size for prefill (default: 64)')
800
+ parser.add_argument('--num-logits', type=int, default=8,
801
+ help='Number of logits outputs from LM head (default: 8, legacy)')
802
+ parser.add_argument('--split-lm-head', type=int,
803
+ help='Number of logits splits from LM head (default: 8 for llama, 16 for qwen)')
804
+
805
+ args = parser.parse_args()
806
+
807
+ # If meta.yaml is provided, load parameters from it
808
+ if args.meta:
809
+ try:
810
+ with open(args.meta, 'r') as f:
811
+ meta = yaml.safe_load(f)
812
+ params = meta['model_info']['parameters']
813
+
814
+ # Set model directory to meta.yaml directory if not specified
815
+ if not args.d or args.d == '.':
816
+ args.d = str(Path(args.meta).parent)
817
+
818
+ # Build model paths based on parameters
819
+ prefix = params.get('model_prefix', 'llama') # Default to 'llama' if not specified
820
+ lut_ffn = f"_lut{params['lut_ffn']}" if params['lut_ffn'] != 'none' else ''
821
+ lut_lmhead = f"_lut{params['lut_lmhead']}" if params['lut_lmhead'] != 'none' else ''
822
+ lut_embeddings = f"_lut{params['lut_embeddings']}" if params['lut_embeddings'] != 'none' else ''
823
+ num_chunks = int(params['num_chunks'])
824
+
825
+ # Set model paths if not specified
826
+ if not args.lmhead:
827
+ args.lmhead = f'{prefix}_lm_head{lut_lmhead}'
828
+ if not args.embed:
829
+ args.embed = f'{prefix}_embeddings{lut_embeddings}' # Changed from lm_head to embeddings
830
+ if not args.ffn:
831
+ args.ffn = f'{prefix}_FFN_PF{lut_ffn}_chunk_01of{num_chunks:02d}'
832
+ if not args.tokenizer:
833
+ # Check if there's a tokenizer_path parameter in meta.yaml
834
+ if 'tokenizer_path' in params:
835
+ args.tokenizer = params['tokenizer_path']
836
+ else:
837
+ # Default to the model directory, but this might need manual override
838
+ args.tokenizer = args.d
839
+
840
+ # Set other parameters if not overridden by command line
841
+ if args.context_length is None:
842
+ args.context_length = int(params['context_length'])
843
+ if args.batch_size is None:
844
+ args.batch_size = int(params['batch_size'])
845
+ args.num_chunks = num_chunks
846
+ # Add num_logits parameter with default of 8, override command line if present in meta
847
+ if 'num_logits' in params:
848
+ args.num_logits = int(params['num_logits'])
849
+
850
+ # Add split_lm_head parameter with default of 8
851
+ if 'split_lm_head' in params:
852
+ args.split_lm_head = int(params['split_lm_head'])
853
+ else:
854
+ args.split_lm_head = 8 # Default value for backward compatibility
855
+
856
+ print(f"\nLoaded parameters from {args.meta}:")
857
+ print(f" Context Length: {args.context_length}")
858
+ print(f" Batch Size: {args.batch_size}")
859
+ print(f" Num Chunks: {args.num_chunks}")
860
+ print(f" Num Logits: {args.num_logits}")
861
+ print(f" Split LM Head: {args.split_lm_head}")
862
+ print(f" Models Directory: {args.d}")
863
+ print(f" Embeddings: {args.embed}")
864
+ print(f" LM Head: {args.lmhead}")
865
+ print(f" FFN: {args.ffn}")
866
+
867
+ except Exception as e:
868
+ print(f"\nError loading meta.yaml: {str(e)}")
869
+ sys.exit(1)
870
+ else:
871
+ # If no meta.yaml, set default split_lm_head if not provided
872
+ if not hasattr(args, 'split_lm_head') or args.split_lm_head is None:
873
+ args.split_lm_head = args.num_logits # Use num_logits as fallback
874
+
875
+ return args
876
+
877
+ def main():
878
+ args = parse_args()
879
+
880
+ # Convert directory to absolute path
881
+ model_dir = Path(args.d).resolve()
882
+ if not model_dir.exists():
883
+ print(f"\nError: Model directory not found: {model_dir}")
884
+ return 1
885
+
886
+ print(f"\nUsing model directory: {model_dir}")
887
+ print(f"Context length: {args.context_length}")
888
+
889
+ try:
890
+ # Update paths to be relative to model directory
891
+ args.embed = str(model_dir / args.embed)
892
+ args.ffn = str(model_dir / args.ffn)
893
+ args.lmhead = str(model_dir / args.lmhead)
894
+
895
+ # Handle tokenizer path separately since it's not relative to model_dir
896
+ if args.tokenizer is None:
897
+ args.tokenizer = str(model_dir)
898
+
899
+ # Check if tokenizer directory exists and has required files
900
+ tokenizer_path = Path(args.tokenizer)
901
+ if not tokenizer_path.exists():
902
+ print(f"\nError: Tokenizer directory not found: {args.tokenizer}")
903
+ return 1
904
+
905
+ # Check if tokenizer has the required files
906
+ required_files = ['tokenizer.json', 'tokenizer_config.json']
907
+ missing_files = [f for f in required_files if not (tokenizer_path / f).exists()]
908
+
909
+ if missing_files:
910
+ print(f"\nWarning: Tokenizer directory missing required files: {missing_files}")
911
+ print(f"Current tokenizer path: {args.tokenizer}")
912
+ print("\nFor Qwen models, you may need to specify the original model directory:")
913
+ print(" python chat.py --meta /tmp/qwen/meta.yaml --tokenizer ~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/YOUR_SNAPSHOT_ID")
914
+ print("\nOr add 'tokenizer_path' to your meta.yaml file.")
915
+
916
+ args.tokenizer = str(Path(args.tokenizer).resolve()) # Convert to absolute path
917
+ print(f"Using tokenizer path: {args.tokenizer}")
918
+
919
+ metadata = {}
920
+ # Load models and extract metadata
921
+ embed_model, ffn_models, lmhead_model, metadata = load_models(args,metadata)
922
+
923
+ print(f"\nMetadata befor args.context_length: {metadata}")
924
+
925
+ # Override context length from command line if provided
926
+ if args.context_length is not None:
927
+ metadata['context_length'] = args.context_length
928
+ metadata['state_length'] = args.context_length # Also update state_length
929
+ print(f"\nOverriding context length from command line: {args.context_length}")
930
+
931
+ # Add num_logits to metadata (legacy support)
932
+ metadata['num_logits'] = getattr(args, 'num_logits', 8)
933
+
934
+ # Add split_lm_head to metadata (preferred)
935
+ metadata['split_lm_head'] = getattr(args, 'split_lm_head', getattr(args, 'num_logits', 8))
936
+
937
+ print(f"\nMetadata after load_models: {metadata}")
938
+ print(f"Using split_lm_head value: {metadata.get('split_lm_head', 8)}")
939
+
940
+ # Load tokenizer with resolved path
941
+ tokenizer = initialize_tokenizer(args.tokenizer)
942
+ if tokenizer is None:
943
+ raise RuntimeError("Failed to initialize tokenizer")
944
+
945
+ # Create unified state once
946
+ state = create_unified_state(ffn_models, metadata['context_length'])
947
+
948
+ # Initialize causal mask once
949
+ causal_mask = initialize_causal_mask(metadata['context_length'])
950
+
951
+ # Warmup runs to prevent Python GIL issues with CoreML !
952
+ if not args.nw:
953
+ for _ in range(2):
954
+ chat_loop(
955
+ embed_model=embed_model,
956
+ ffn_models=ffn_models,
957
+ lmhead_model=lmhead_model,
958
+ tokenizer=tokenizer,
959
+ metadata=metadata,
960
+ state=state,
961
+ causal_mask=causal_mask, # Pass the causal mask
962
+ warmup=True,
963
+ auto_prompt="who are you?"
964
+ )
965
+
966
+ # Main run
967
+ chat_loop(
968
+ embed_model=embed_model,
969
+ ffn_models=ffn_models,
970
+ lmhead_model=lmhead_model,
971
+ tokenizer=tokenizer,
972
+ metadata=metadata,
973
+ state=state,
974
+ causal_mask=causal_mask, # Pass the causal mask
975
+ warmup=False,
976
+ auto_prompt=args.prompt,
977
+ save_file=args.save
978
+ )
979
+
980
+ except Exception as e:
981
+ print(f"\nError: {str(e)}")
982
+ import traceback
983
+ traceback.print_exc()
984
+ return 1
985
+
986
+ return 0
987
+
988
+ if __name__ == "__main__":
989
+ exit(main())
chat_full.py ADDED
@@ -0,0 +1,1025 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # chat.py
2
+ #!/usr/bin/env python3
3
+ # chat.py
4
+ # Copyright (c) 2025 Anemll
5
+ # Licensed under the MIT License
6
+
7
+ import argparse
8
+ import os
9
+ import re
10
+ import glob
11
+ from pathlib import Path
12
+ import coremltools as ct
13
+ from transformers import LlamaTokenizer, AutoTokenizer
14
+ import torch
15
+ import torch.nn.functional as F
16
+ import numpy as np
17
+ import queue
18
+ import threading
19
+ import time
20
+ import yaml
21
+ import sys
22
+
23
+ # ANSI color codes
24
+ LIGHT_BLUE = "\033[94m"
25
+ DARK_BLUE = "\033[34m"
26
+ LIGHT_GREEN = "\033[92m"
27
+ RESET_COLOR = "\033[0m"
28
+
29
+ # Add at the top with other constants
30
+ WARMUP_TOKEN_LIMIT = 10 # Maximum tokens to generate during warmup
31
+ THINKING_MODE = False
32
+ THINKING_PROMPT = """You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem."""
33
+ DEBUG_LEVEL = 0 # Default debug level
34
+
35
+ class TokenPrinter:
36
+ """Handles background printing of generated tokens."""
37
+ def __init__(self, tokenizer):
38
+ self.tokenizer = tokenizer
39
+ self.token_queue = queue.Queue()
40
+ self.stop_event = threading.Event()
41
+ self.thread = None
42
+ self.buffer = ""
43
+ self.lock = threading.Lock()
44
+ self.thinking = True # Track if we're still in thinking mode
45
+ self.decoding_buffer = [] # Buffer for token IDs
46
+ # Timing and stats tracking
47
+ self.start_time = time.time()
48
+ self.token_count = 0
49
+ self.prefill_time = 0
50
+ self.inference_time = 0
51
+ self.context_pos = 0
52
+ self.start()
53
+
54
+ def start(self):
55
+ """Start the printer thread."""
56
+ if self.thread is None:
57
+ self.thread = threading.Thread(target=self._print_worker)
58
+ self.thread.daemon = True
59
+ self.thread.start()
60
+
61
+ def add_token(self, token_id):
62
+ """Add a token to the print queue."""
63
+ if not self.stop_event.is_set():
64
+ self.token_queue.put(token_id)
65
+ self.token_count += 1
66
+
67
+ def drain_buffer(self):
68
+ """Decode token IDs from decoding_buffer in the main thread."""
69
+ if not self.decoding_buffer:
70
+ return
71
+
72
+ # Decode all tokens at once in the main thread
73
+ token_str = self.tokenizer.decode(self.decoding_buffer)
74
+ self.decoding_buffer.clear()
75
+
76
+ # Color-handling logic
77
+ if self.thinking and "</think>" in token_str:
78
+ self.thinking = False
79
+ parts = token_str.split("</think>")
80
+ if len(parts) > 0:
81
+ print(parts[0] + "</think>", end='', flush=True)
82
+ if len(parts) > 1:
83
+ print(LIGHT_BLUE + parts[1], end='', flush=True)
84
+ else:
85
+ if not self.thinking:
86
+ print(LIGHT_BLUE + token_str, end='', flush=True)
87
+ else:
88
+ print(token_str, end='', flush=True)
89
+
90
+ def _print_worker(self):
91
+ """Worker thread that takes token_ids from the queue."""
92
+ while not self.stop_event.is_set():
93
+ try:
94
+ token_id = self.token_queue.get(timeout=0.01)
95
+ with self.lock:
96
+ self.decoding_buffer.append(token_id)
97
+ self.token_queue.task_done()
98
+ except queue.Empty:
99
+ continue
100
+ except Exception as e:
101
+ print(f"\nError: Token printer error: {str(e)}")
102
+ break
103
+
104
+ def stop(self):
105
+ """Stop the printer thread."""
106
+ if self.thread and self.thread.is_alive():
107
+ self.stop_event.set()
108
+ try:
109
+ self.thread.join(timeout=1.0)
110
+ except Exception:
111
+ pass
112
+ print(RESET_COLOR) # Reset color at the end
113
+ return self.buffer
114
+
115
+ def set_timing(self, prefill_time, inference_time, context_pos):
116
+ """Set timing information."""
117
+ self.prefill_time = prefill_time
118
+ self.inference_time = inference_time
119
+ self.context_pos = context_pos
120
+
121
+ def parse_model_path(path):
122
+ """Parse model path and return full path with .mlmodelc or .mlpackage extension."""
123
+ path = Path(path)
124
+
125
+ # If path exists exactly as specified, return it
126
+ if path.exists():
127
+ return str(path)
128
+
129
+ # Try with both extensions
130
+ candidates = [
131
+ path, # Original path
132
+ path.with_suffix('.mlmodelc'), # With .mlmodelc
133
+ path.with_suffix('.mlpackage'), # With .mlpackage
134
+ Path(str(path) + '.mlmodelc'), # Handle case where extension is included
135
+ Path(str(path) + '.mlpackage')
136
+ ]
137
+
138
+ # Try all possible paths
139
+ for candidate in candidates:
140
+ if candidate.exists():
141
+ print(f"Found model at: {candidate}")
142
+ return str(candidate)
143
+
144
+ # If embeddings with LUT suffix not found, try without LUT suffix
145
+ if "_lut" in str(path) and "embeddings" in str(path):
146
+ print(f"Failed to find {path}, trying without LUT suffix...")
147
+ # Remove LUT suffix
148
+ path_no_lut = str(path).split("_lut")[0]
149
+ path_no_lut = Path(path_no_lut)
150
+
151
+ # Try candidates without LUT suffix
152
+ candidates_no_lut = [
153
+ path_no_lut,
154
+ path_no_lut.with_suffix('.mlmodelc'),
155
+ path_no_lut.with_suffix('.mlpackage'),
156
+ Path(str(path_no_lut) + '.mlmodelc'),
157
+ Path(str(path_no_lut) + '.mlpackage')
158
+ ]
159
+
160
+ for candidate in candidates_no_lut:
161
+ if candidate.exists():
162
+ print(f"Found model at: {candidate}")
163
+ return str(candidate)
164
+
165
+ # Add no-LUT candidates to the list for error reporting
166
+ candidates.extend(candidates_no_lut)
167
+
168
+ # If we get here, no valid path was found
169
+ print("\nError: Model not found. Tried following paths:")
170
+ for candidate in candidates:
171
+ print(f" {candidate}")
172
+ raise FileNotFoundError(f"Model not found: {path}")
173
+
174
+ def parse_ffn_filename(path):
175
+ """Parse FFN model filename to extract chunk information."""
176
+ path = Path(path)
177
+ pattern = r'FFN_PF.*_chunk_(\d+)of(\d+)'
178
+ match = re.search(pattern, path.name)
179
+
180
+ if match:
181
+ current_chunk = int(match.group(1))
182
+ total_chunks = int(match.group(2))
183
+ return current_chunk, total_chunks
184
+ return None, None
185
+
186
+ def find_all_chunks(base_path):
187
+ """Find all chunk files matching the base FFN path pattern."""
188
+ path = Path(base_path)
189
+ pattern = re.sub(r'_chunk_\d+of\d+', '_chunk_*', str(path))
190
+ return sorted(glob.glob(pattern))
191
+
192
+ def load_model(path, function_name=None):
193
+ """Load a CoreML model, handling both .mlmodelc and .mlpackage formats."""
194
+ path = Path(path)
195
+ compute_unit = ct.ComputeUnit.CPU_AND_NE
196
+
197
+ try:
198
+ if path.suffix == '.mlmodelc':
199
+ # For compiled models (.mlmodelc), use CompiledMLModel
200
+ if function_name:
201
+ return ct.models.CompiledMLModel(str(path), compute_unit, function_name=function_name)
202
+ else:
203
+ return ct.models.CompiledMLModel(str(path), compute_unit)
204
+ else:
205
+ # For packages (.mlpackage)
206
+ if function_name:
207
+ return ct.models.MLModel(str(path), function_name=function_name)
208
+ else:
209
+ return ct.models.MLModel(str(path))
210
+
211
+ except RuntimeError as e:
212
+ if "valid manifest does not exist" in str(e):
213
+ print(f"\nError: Could not load compiled model at {path}")
214
+ print("This might be because:")
215
+ print("1. The model is not properly compiled")
216
+ print("2. The model was compiled for a different OS version")
217
+ print("3. The model needs to be recompiled")
218
+ print("\nTry using the .mlpackage version instead, or recompile the model.")
219
+ raise
220
+
221
+ def parse_args():
222
+ parser = argparse.ArgumentParser(description='Full Chat with CoreML LLaMA with context window shifting, gil resolved (c) 2025 Anemll')
223
+
224
+ # Add meta.yaml option
225
+ parser.add_argument('--meta', type=str, help='Path to meta.yaml to load all parameters')
226
+
227
+ # Add existing arguments
228
+ parser.add_argument('--d', '--dir', type=str, default='.',
229
+ help='Directory containing model files (default: current directory)')
230
+ parser.add_argument('--embed', type=str, required=False,
231
+ help='Path to embeddings model (relative to --dir)')
232
+ parser.add_argument('--ffn', type=str, required=False,
233
+ help='Path to FFN model (can be chunked, relative to --dir)')
234
+ parser.add_argument('--lmhead', type=str, required=False,
235
+ help='Path to LM head model (relative to --dir)')
236
+ parser.add_argument('--tokenizer', type=str, required=False,
237
+ help='Path to tokenizer')
238
+
239
+ # Add new argument for auto-generation
240
+ parser.add_argument('--prompt', type=str,
241
+ help='If specified, run once with this prompt and exit')
242
+
243
+ # Add no-warmup flag
244
+ parser.add_argument('--nw', action='store_true',
245
+ help='Skip warmup phase')
246
+
247
+ # Add debug level
248
+ parser.add_argument('--debug-level', type=int, default=0,
249
+ help='Debug level (0=none, 1=print prompts, 2=more verbose)')
250
+
251
+ # Model configuration
252
+ parser.add_argument('--context-length', type=int,
253
+ help='Context length for the model (default: 512), if not provided, it will be detected from the model directory name ctxNUMBER')
254
+ parser.add_argument('--batch-size', type=int,
255
+ help='Batch size for prefill (default: 64)')
256
+
257
+ args = parser.parse_args()
258
+
259
+ # If meta.yaml is provided, load parameters from it
260
+ if args.meta:
261
+ try:
262
+ with open(args.meta, 'r') as f:
263
+ meta = yaml.safe_load(f)
264
+ params = meta['model_info']['parameters']
265
+
266
+ # Set model directory to meta.yaml directory if not specified
267
+ if not args.d or args.d == '.':
268
+ args.d = str(Path(args.meta).parent)
269
+
270
+ # Build model paths based on parameters
271
+ prefix = params.get('model_prefix', 'llama') # Default to 'llama' if not specified
272
+ lut_ffn = f"_lut{params['lut_ffn']}" if params['lut_ffn'] != 'none' else ''
273
+ lut_lmhead = f"_lut{params['lut_lmhead']}" if params['lut_lmhead'] != 'none' else ''
274
+ lut_embeddings = f"_lut{params['lut_embeddings']}" if params['lut_embeddings'] != 'none' else ''
275
+ num_chunks = int(params['num_chunks'])
276
+
277
+ # Set model paths if not specified
278
+ if not args.lmhead:
279
+ args.lmhead = f'{prefix}_lm_head{lut_lmhead}'
280
+ if not args.embed:
281
+ args.embed = f'{prefix}_embeddings{lut_embeddings}' # Changed from lm_head to embeddings
282
+ if not args.ffn:
283
+ args.ffn = f'{prefix}_FFN_PF{lut_ffn}_chunk_01of{num_chunks:02d}'
284
+ if not args.tokenizer:
285
+ args.tokenizer = args.d
286
+
287
+ # Set other parameters if not overridden by command line
288
+ if args.context_length is None:
289
+ args.context_length = int(params['context_length'])
290
+ if args.batch_size is None:
291
+ args.batch_size = int(params['batch_size'])
292
+ args.num_chunks = num_chunks
293
+
294
+ # Parse split_lm_head parameter from meta.yaml
295
+ if 'split_lm_head' in params:
296
+ args.split_lm_head = int(params['split_lm_head'])
297
+ else:
298
+ args.split_lm_head = 8 # Default value
299
+
300
+ print(f"\nLoaded parameters from {args.meta}:")
301
+ print(f" Context Length: {args.context_length}")
302
+ print(f" Batch Size: {args.batch_size}")
303
+ print(f" Num Chunks: {args.num_chunks}")
304
+ print(f" Split LM Head: {args.split_lm_head}")
305
+ print(f" Models Directory: {args.d}")
306
+ print(f" Embeddings: {args.embed}")
307
+ print(f" LM Head: {args.lmhead}")
308
+ print(f" FFN: {args.ffn}")
309
+
310
+ except Exception as e:
311
+ print(f"\nError loading meta.yaml: {str(e)}")
312
+ sys.exit(1)
313
+
314
+ return args
315
+
316
+ def load_metadata(model,args):
317
+ # Extract metadata and config parameters
318
+ metadata = {}
319
+ if hasattr(model, 'user_defined_metadata'):
320
+ meta = model.user_defined_metadata
321
+
322
+ # Extract key parameters with defaults
323
+ metadata['context_length'] = int(meta.get('com.anemll.context_length', 512))
324
+ metadata['state_length'] = int(meta.get('com.anemll.state_length', metadata['context_length'])) # Added state_length
325
+ metadata['batch_size'] = int(meta.get('com.anemll.batch_size', 64))
326
+ metadata['lut_bits'] = int(meta.get('com.anemll.lut_bits', 0))
327
+ metadata['num_chunks'] = int(meta.get('com.anemll.num_chunks', 1))
328
+
329
+ print("\nExtracted Parameters:")
330
+ print(f" Context Length: {metadata['context_length']}")
331
+ print(f" State Length: {metadata['state_length']}")
332
+ print(f" Prefill Batch Size: {metadata['batch_size']}")
333
+ print(f" LUT Bits: {metadata['lut_bits']}")
334
+ print(f" Number of Chunks: {metadata['num_chunks']}")
335
+
336
+ # Print model info
337
+ print("\nModel Info:")
338
+ if 'com.anemll.info' in meta:
339
+ print(f" {meta['com.anemll.info']}")
340
+ if 'com.github.apple.coremltools.version' in meta:
341
+ print(f" CoreML Tools: {meta['com.github.apple.coremltools.version']}")
342
+
343
+ # Print model input/output shapes
344
+ print("\nModel Shapes:")
345
+ if hasattr(model, 'input_description'):
346
+ print(" Inputs:")
347
+ try:
348
+ if hasattr(model.input_description, 'items'):
349
+ for name, desc in model.input_description.items():
350
+ print(f" {name}: {desc}")
351
+ else:
352
+ print(f" {model.input_description}")
353
+ except:
354
+ print(f" Input description: {type(model.input_description)}")
355
+ if hasattr(model, 'output_description'):
356
+ print(" Outputs:")
357
+ try:
358
+ if hasattr(model.output_description, 'items'):
359
+ for name, desc in model.output_description.items():
360
+ print(f" {name}: {desc}")
361
+ else:
362
+ print(f" {model.output_description}")
363
+ except:
364
+ print(f" Output description: {type(model.output_description)}")
365
+ else:
366
+ print("\nWarning: No metadata found in model")
367
+
368
+ # Check if model directory name contains context length pattern (ctxXXX)
369
+ ctx_len = 512
370
+ if args.context_length is None:
371
+ import re
372
+ ctx_match = re.search(r'ctx(\d+)', str(args.d))
373
+ if ctx_match:
374
+ ctx_len0 = int(ctx_match.group(1))
375
+ if 512 <= ctx_len0 <= 8096:
376
+ ctx_len = ctx_len0
377
+ print(f"\nDetected context length {ctx_len} from directory name")
378
+ else:
379
+ print(f"\nWarning: No context length found in directory {ctx_len} from directory name {args.d}")
380
+ else:
381
+ ctx_len = args.context_length
382
+
383
+ # Use defaults or values from args
384
+ metadata['context_length'] = ctx_len
385
+ metadata['state_length'] = ctx_len
386
+ # Get batch size from args or use default
387
+ metadata['batch_size'] = getattr(args, 'batch_size', 64)
388
+ metadata['lut_bits'] = 4
389
+ metadata['num_chunks'] = getattr(args, 'num_chunks', 4)
390
+ print("\nUsing parameters:")
391
+ print(f" Context Length: {metadata['context_length']}")
392
+ print(f" State Length: {metadata['state_length']}")
393
+ print(f" Prefill Batch Size: {metadata['batch_size']}")
394
+ print(f" LUT Bits: {metadata['lut_bits']}")
395
+ print(f" Number of Chunks: {metadata['num_chunks']}")
396
+
397
+ # Override with values from args if they exist
398
+ if hasattr(args, 'batch_size') and args.batch_size is not None:
399
+ metadata['batch_size'] = args.batch_size
400
+ print(f"\nOverriding batch size from args: {args.batch_size}")
401
+ if hasattr(args, 'num_chunks') and args.num_chunks is not None:
402
+ metadata['num_chunks'] = args.num_chunks
403
+ print(f"\nOverriding num chunks from args: {args.num_chunks}")
404
+
405
+ return metadata
406
+
407
+ def load_models(args,metadata):
408
+ """Load all required models and extract metadata."""
409
+ print("\nLoading models...")
410
+
411
+ try:
412
+ # Load embeddings model
413
+ print("\nLoading embeddings model...")
414
+ embed_path = parse_model_path(args.embed)
415
+ print(f"Loading from: {embed_path}")
416
+ embed_model = load_model(embed_path)
417
+ print("Embeddings model loaded successfully")
418
+ metadata = load_metadata(embed_model,args)
419
+
420
+
421
+
422
+ # Load LM head model
423
+ print("\nLoading LM head model...")
424
+ lmhead_path = parse_model_path(args.lmhead)
425
+ print(f"Loading from: {lmhead_path}")
426
+ lmhead_model = load_model(lmhead_path)
427
+ print("LM head model loaded successfully")
428
+
429
+ # Parse FFN path and find chunks if needed
430
+ print("\nLoading FFN+PREFILL model(s)...")
431
+ ffn_path = parse_model_path(args.ffn)
432
+ chunk_no, total_chunks = parse_ffn_filename(ffn_path)
433
+
434
+ ffn_models = []
435
+ if chunk_no and total_chunks:
436
+ print(f"\nDetected chunked FFN+PREFILL model ({total_chunks} chunks)")
437
+ # Find and load all chunks
438
+ chunk_paths = find_all_chunks(ffn_path)
439
+ if len(chunk_paths) != total_chunks:
440
+ raise ValueError(f"Found {len(chunk_paths)} chunks but filename indicates {total_chunks} chunks")
441
+
442
+ for chunk_path in chunk_paths:
443
+ print(f"\nLoading FFN+PREFILL chunk: {Path(chunk_path).name}")
444
+ try:
445
+ # For chunked models, we need both infer and prefill functions
446
+ ffn_models.append({
447
+ 'infer': load_model(chunk_path, function_name='infer'),
448
+ 'prefill': load_model(chunk_path, function_name='prefill')
449
+ })
450
+ print("Chunk loaded successfully")
451
+ except Exception as e:
452
+ print(f"Error loading chunk {chunk_path}: {str(e)}")
453
+ raise
454
+ metadata = load_metadata(ffn_models[0],args)
455
+
456
+ else:
457
+ print("\nLoading single FFN model...")
458
+ ffn_models.append(load_model(ffn_path))
459
+ print("FFN model loaded successfully")
460
+
461
+ return embed_model, ffn_models, lmhead_model, metadata
462
+
463
+ except Exception as e:
464
+ print(f"\nError loading models: {str(e)}")
465
+ print("\nPlease ensure all model files exist and are accessible.")
466
+ print("Expected files:")
467
+ print(f" Embeddings: {args.embed}")
468
+ print(f" LM Head: {args.lmhead}")
469
+ print(f" FFN: {args.ffn}")
470
+ raise
471
+
472
+ # At the top of the file, make this a default path
473
+
474
+ def initialize_tokenizer(model_path=None):
475
+ """Initialize and configure the tokenizer."""
476
+ try:
477
+
478
+
479
+ tokenizer = AutoTokenizer.from_pretrained(
480
+ str(model_path),
481
+ use_fast=False,
482
+ trust_remote_code=True
483
+ )
484
+
485
+ print("\nTokenizer Configuration:")
486
+ print(f"Tokenizer type: {type(tokenizer)}")
487
+ print(f"Tokenizer name: {tokenizer.__class__.__name__}")
488
+ print(f"Vocabulary size: {len(tokenizer)}")
489
+ print(f"Model max length: {tokenizer.model_max_length}")
490
+
491
+ if tokenizer.pad_token is None:
492
+ tokenizer.pad_token = tokenizer.eos_token
493
+ tokenizer.pad_token_id = tokenizer.eos_token_id
494
+ print("Set PAD token to EOS token")
495
+
496
+ tokenizer.padding_side = "left"
497
+
498
+ print(f"\nSpecial Tokens:")
499
+ print(f"PAD token: '{tokenizer.pad_token}' (ID: {tokenizer.pad_token_id})")
500
+ print(f"EOS token: '{tokenizer.eos_token}' (ID: {tokenizer.eos_token_id})")
501
+ print(f"BOS token: '{tokenizer.bos_token}' (ID: {tokenizer.bos_token_id})")
502
+ print(f"UNK token: '{tokenizer.unk_token}' (ID: {tokenizer.unk_token_id})")
503
+
504
+ return tokenizer
505
+
506
+ except Exception as e:
507
+ print(f"\nError: Failed to load tokenizer from {model_path}")
508
+ print(f"Error details: {str(e)}")
509
+ print(f"Error type: {type(e)}")
510
+ print("\nThis code requires a Llama 3.2 model for chat template functionality.")
511
+ print("Please provide the path to a Llama 3.2 model directory.")
512
+ import traceback
513
+ traceback.print_exc()
514
+ raise
515
+
516
+
517
+
518
+ def make_causal_mask(length, start):
519
+ """Create causal attention mask."""
520
+ mask = np.full((1, 1, length, length), -np.inf, dtype=np.float16)
521
+ row_indices = np.arange(length).reshape(length, 1)
522
+ col_indices = np.arange(length).reshape(1, length)
523
+ mask[:, :, col_indices <= (row_indices + start)] = 0
524
+ return mask
525
+
526
+ def run_prefill(embed_model, ffn_models, input_ids, current_pos, context_length, batch_size, state, causal_mask):
527
+ """Run prefill on the input sequence."""
528
+ #print(f"[DEBUG] Running prefill from 0 to {current_pos}")
529
+
530
+ # Process in batches
531
+ batch_pos = 0
532
+ while batch_pos < current_pos:
533
+ batch_end = min(batch_pos + batch_size, current_pos)
534
+ current_batch_size = batch_end - batch_pos
535
+
536
+ #print(f"[DEBUG] Prefill batch {batch_pos}-{batch_end} (size={current_batch_size})")
537
+
538
+ # Get current batch
539
+ batch_input = input_ids[:, batch_pos:batch_end]
540
+
541
+ # Pad to full batch size
542
+ batch_input = F.pad(
543
+ batch_input,
544
+ (0, batch_size - current_batch_size),
545
+ value=0
546
+ )
547
+
548
+ # Generate position IDs for this batch
549
+ position_ids = torch.arange(batch_pos, batch_pos + batch_size, dtype=torch.int32)
550
+
551
+ # Use the pre-initialized causal mask and extract the batch portion
552
+ batch_causal_mask = causal_mask[:, :, batch_pos:batch_pos + batch_size, :]
553
+
554
+ # Run embeddings
555
+ hidden_states = torch.from_numpy(
556
+ embed_model.predict({'input_ids': batch_input.numpy()})['hidden_states']
557
+ )
558
+
559
+ # Run through FFN chunks
560
+ for ffn_model in ffn_models:
561
+ if isinstance(ffn_model, dict):
562
+ inputs = {
563
+ 'hidden_states': hidden_states.numpy(),
564
+ 'position_ids': position_ids.numpy(),
565
+ 'causal_mask': batch_causal_mask.numpy(),
566
+ 'current_pos': np.array([batch_pos], dtype=np.int32)
567
+ }
568
+ output = ffn_model['prefill'].predict(inputs, state)
569
+ hidden_states = torch.from_numpy(output['output_hidden_states'])
570
+
571
+ batch_pos = batch_end
572
+
573
+ return torch.tensor([current_pos], dtype=torch.int32)
574
+
575
+ def generate_next_token(embed_model, ffn_models, lmhead_model, input_ids, pos, context_length, state, causal_mask, metadata=None, temperature=0.0):
576
+ """Generate the next token."""
577
+ # Get current token
578
+ current_token = input_ids[:, pos-1:pos]
579
+
580
+ # Run embeddings
581
+ hidden_states = torch.from_numpy(
582
+ embed_model.predict({'input_ids': current_token.numpy()})['hidden_states']
583
+ )
584
+
585
+ # Create masks
586
+ update_mask = torch.zeros((1, 1, context_length, 1), dtype=torch.float16)
587
+ update_mask[0, 0, pos-1, 0] = 1.0
588
+ position_ids = torch.tensor([pos-1], dtype=torch.int32)
589
+
590
+ # Use the pre-initialized causal mask and extract the single position portion
591
+ single_causal_mask = causal_mask[:, :, pos-1:pos, :]
592
+
593
+ # Run through FFN chunks
594
+ for ffn_model in ffn_models:
595
+ if isinstance(ffn_model, dict):
596
+ inputs = {
597
+ 'hidden_states': hidden_states.numpy(),
598
+ 'update_mask': update_mask.numpy(),
599
+ 'position_ids': position_ids.numpy(),
600
+ 'causal_mask': single_causal_mask.numpy(),
601
+ 'current_pos': position_ids.numpy()
602
+ }
603
+ output = ffn_model['infer'].predict(inputs, state)
604
+ hidden_states = torch.from_numpy(output['output_hidden_states'])
605
+
606
+ # Run LM head and get next token
607
+ lm_output = lmhead_model.predict({'hidden_states': hidden_states.numpy()})
608
+
609
+ if 'logits1' in lm_output:
610
+ logits_parts = []
611
+ for i in range(1, metadata.get('split_lm_head', 8) + 1):
612
+ key = f'logits{i}'
613
+ if key in lm_output:
614
+ logits_parts.append(torch.from_numpy(lm_output[key]))
615
+ logits = torch.cat(logits_parts, dim=-1)
616
+ else:
617
+ logits = torch.from_numpy(lm_output['output_logits'])
618
+
619
+ if temperature > 0:
620
+ logits = logits / temperature
621
+ probs = F.softmax(logits[0, -1, :], dim=-1)
622
+ next_token = torch.multinomial(probs, num_samples=1).item()
623
+ else:
624
+ next_token = torch.argmax(logits[0, -1, :]).item()
625
+
626
+ return next_token
627
+
628
+ def create_unified_state(ffn_models, context_length):
629
+ """Create unified KV cache state for transformer."""
630
+ if isinstance(ffn_models[0], dict):
631
+ # Use first FFN model's prefill function to create state
632
+ state = ffn_models[0]['prefill'].make_state()
633
+ print(f"\nCreated unified transformer state for {len(ffn_models)} chunks")
634
+ return state
635
+ else:
636
+ state = ffn_models[0].make_state()
637
+ print("\nCreated unified transformer state")
638
+ return state
639
+
640
+ def initialize_causal_mask(context_length):
641
+ """Initialize causal mask for transformer attention."""
642
+ causal_mask = make_causal_mask(context_length, 0)
643
+ causal_mask = torch.tensor(causal_mask, dtype=torch.float16)
644
+ print(f"\nInitialized causal mask for context length {context_length}")
645
+ return causal_mask
646
+
647
+ def get_user_input():
648
+ """Get input from user, handling special key combinations."""
649
+ global THINKING_MODE
650
+ try:
651
+ import termios
652
+ import tty
653
+ import sys
654
+
655
+ def _getch():
656
+ fd = sys.stdin.fileno()
657
+ old_settings = termios.tcgetattr(fd)
658
+ try:
659
+ tty.setraw(sys.stdin.fileno())
660
+ ch = sys.stdin.read(1)
661
+ finally:
662
+ termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
663
+ return ch
664
+
665
+ buffer = []
666
+ while True:
667
+ char = _getch()
668
+
669
+ # Debug: print the character code
670
+ print(f"\nKey pressed: {repr(char)} (hex: {hex(ord(char))})")
671
+
672
+ # Check for Enter key
673
+ if char == '\r' or char == '\n':
674
+ print() # Move to next line
675
+ input_text = ''.join(buffer)
676
+ # Check if the command is /t
677
+ if input_text == '/t':
678
+ THINKING_MODE = not THINKING_MODE
679
+ print(f"Thinking mode {'ON' if THINKING_MODE else 'OFF'}")
680
+ buffer = [] # Clear buffer
681
+ print(f"\n{LIGHT_GREEN}You{' (thinking)' if THINKING_MODE else ''}:{RESET_COLOR}", end=' ', flush=True)
682
+ continue
683
+ return input_text
684
+
685
+ # Handle backspace
686
+ if char == '\x7f': # backspace
687
+ if buffer:
688
+ buffer.pop()
689
+ sys.stdout.write('\b \b') # Erase character
690
+ sys.stdout.flush()
691
+ continue
692
+
693
+ # Handle Ctrl-C
694
+ if char == '\x03': # Ctrl-C
695
+ print("^C")
696
+ raise KeyboardInterrupt
697
+
698
+ # Print character and add to buffer
699
+ sys.stdout.write(char)
700
+ sys.stdout.flush()
701
+ buffer.append(char)
702
+
703
+ except ImportError:
704
+ # Fallback for systems without termios
705
+ return input("> ")
706
+
707
+ def chat_loop(embed_model, ffn_models, lmhead_model, tokenizer, metadata, state, causal_mask, auto_prompt=None, warmup=False):
708
+ """Interactive chat loop."""
709
+ global THINKING_MODE
710
+ global DEBUG_LEVEL
711
+ context_length = metadata.get('context_length')
712
+ batch_size = metadata.get('batch_size', 64)
713
+
714
+ if not warmup:
715
+ print(f"\nUsing context length: {context_length}")
716
+ print("\nStarting chat session. Press Ctrl+D to exit.")
717
+ print("Type your message and press Enter to chat. Use /t to toggle thinking mode.")
718
+ print(f"Thinking mode is {'ON' if THINKING_MODE else 'OFF'}")
719
+
720
+ # Keep track of conversation history
721
+ conversation = []
722
+
723
+ try:
724
+ while True:
725
+ try:
726
+ if not warmup:
727
+ print(f"\n{LIGHT_GREEN}You{' (thinking)' if THINKING_MODE else ''}:{RESET_COLOR}", end=' ', flush=True)
728
+ if auto_prompt is not None:
729
+ user_input = auto_prompt
730
+ if not warmup:
731
+ print(user_input)
732
+ else:
733
+ user_input = input().strip()
734
+ except EOFError:
735
+ if not warmup:
736
+ print("\nExiting chat...")
737
+ break
738
+
739
+ if not user_input:
740
+ continue
741
+
742
+ # Handle /t command
743
+ if user_input == "/t":
744
+ THINKING_MODE = not THINKING_MODE
745
+ print(f"Thinking mode {'ON' if THINKING_MODE else 'OFF'}")
746
+ continue
747
+
748
+ # Add user message to conversation
749
+ conversation.append({"role": "user", "content": user_input})
750
+
751
+ # Format using chat template with full history
752
+ if THINKING_MODE:
753
+ # Add thinking prompt to system message
754
+ conversation_with_thinking = [{"role": "system", "content": THINKING_PROMPT}] + conversation
755
+ base_input_ids = tokenizer.apply_chat_template(
756
+ conversation_with_thinking,
757
+ return_tensors="pt",
758
+ add_generation_prompt=True
759
+ ).to(torch.int32)
760
+
761
+ # Print full prompt if debug level >= 1
762
+ if DEBUG_LEVEL >= 1 and not warmup:
763
+ print(f"\n{DARK_BLUE}Debug: Full prompt with thinking:{RESET_COLOR}")
764
+ print(tokenizer.decode(base_input_ids[0]))
765
+ else:
766
+ base_input_ids = tokenizer.apply_chat_template(
767
+ conversation,
768
+ return_tensors="pt",
769
+ add_generation_prompt=True
770
+ ).to(torch.int32)
771
+
772
+ # Print full prompt if debug level >= 1
773
+ if DEBUG_LEVEL >= 1 and not warmup:
774
+ print(f"\n{DARK_BLUE}Debug: Full prompt:{RESET_COLOR}")
775
+ print(tokenizer.decode(base_input_ids[0]))
776
+
777
+ # Check if we need to trim history
778
+ while base_input_ids.size(1) > context_length - 100: # Leave room for response
779
+ # Remove oldest message pair (user + assistant)
780
+ if len(conversation) > 2:
781
+ conversation = conversation[2:] # Remove oldest pair
782
+ base_input_ids = tokenizer.apply_chat_template(
783
+ conversation,
784
+ return_tensors="pt",
785
+ add_generation_prompt=True
786
+ ).to(torch.int32)
787
+ else:
788
+ # If only current message remains and still too long, truncate
789
+ base_input_ids = base_input_ids[:, -context_length//2:]
790
+ break
791
+
792
+ context_pos = base_input_ids.size(1)
793
+
794
+ # Pad sequence to context_size
795
+ input_ids = F.pad(
796
+ base_input_ids,
797
+ (0, context_length - context_pos),
798
+ value=0
799
+ )
800
+
801
+ if not warmup:
802
+ print(f"\n{LIGHT_BLUE}Assistant:{RESET_COLOR}", end=' ', flush=True)
803
+
804
+ # split_lm_head should already be in metadata from caller
805
+
806
+ # Initialize token printer and collect response
807
+ token_printer = TokenPrinter(tokenizer)
808
+ response_tokens = []
809
+ generation_start_time = time.time()
810
+
811
+ try:
812
+ # Run prefill on entire context
813
+ current_pos = run_prefill(
814
+ embed_model,
815
+ ffn_models,
816
+ input_ids,
817
+ context_pos,
818
+ context_length,
819
+ batch_size,
820
+ state,
821
+ causal_mask
822
+ )
823
+ #print(f"\n[DEBUG] After initial prefill - current_pos: {current_pos}")
824
+
825
+ # Generation loop
826
+ pos = context_pos
827
+ tokens_generated = 0
828
+ inference_start = time.time() # Start inference timing
829
+
830
+ while True:
831
+ # Check if we need to shift window
832
+ if pos >= context_length - 2:
833
+ # Calculate shift to maintain full batches
834
+ batch_size = metadata.get('batch_size', 64)
835
+ # Calculate max batches that fit in context
836
+ max_batches = context_length // batch_size
837
+ desired_batches = max(1, max_batches - 2) # Leave room for new tokens
838
+ new_size = min(desired_batches * batch_size, context_length - batch_size)
839
+
840
+ # Create shifted input_ids
841
+ tmp = torch.zeros((1, context_length), dtype=torch.int32)
842
+ tmp[:,0:new_size] = input_ids[:,pos-new_size:pos]
843
+ input_ids = tmp
844
+
845
+ # Reset state and run prefill
846
+ # keep the same state
847
+ #state = create_unified_state(ffn_models, context_length)
848
+ current_pos = run_prefill(
849
+ embed_model,
850
+ ffn_models,
851
+ input_ids,
852
+ new_size, # Prefill the entire shifted content
853
+ context_length,
854
+ batch_size,
855
+ state,
856
+ causal_mask
857
+ )
858
+
859
+ # Start generating from the next position
860
+ pos = new_size # Don't back up, continue from where we left off
861
+
862
+ #print(f"\n[DEBUG] After shift - next token will be at pos {pos}")
863
+ #print(f"[DEBUG] Context before next token: {tokenizer.decode(input_ids[0, pos-40:pos])}")
864
+
865
+ window_shifted = True
866
+
867
+ # Generate next token
868
+ next_token = generate_next_token(
869
+ embed_model,
870
+ ffn_models,
871
+ lmhead_model,
872
+ input_ids,
873
+ pos,
874
+ context_length,
875
+ state,
876
+ causal_mask,
877
+ metadata
878
+ )
879
+
880
+ # Add token
881
+ input_ids[0, pos] = next_token
882
+ if not warmup:
883
+ token_printer.add_token(next_token)
884
+ token_printer.drain_buffer()
885
+ response_tokens.append(next_token)
886
+
887
+ pos += 1
888
+ tokens_generated += 1
889
+
890
+ # In warmup mode, limit tokens
891
+ if warmup and tokens_generated >= WARMUP_TOKEN_LIMIT:
892
+ break
893
+
894
+ if next_token == tokenizer.eos_token_id:
895
+ break
896
+
897
+ inference_time = time.time() - inference_start # Calculate inference time
898
+
899
+ # Add assistant response to conversation
900
+ response_text = token_printer.stop()
901
+ conversation.append({"role": "assistant", "content": response_text})
902
+
903
+ # Print stats only if not in warmup
904
+ if not warmup:
905
+ total_time = time.time() - generation_start_time
906
+ prefill_time = total_time - inference_time
907
+ inference_tokens_per_sec = len(response_tokens) / inference_time if inference_time > 0 else 0
908
+ prefill_ms = prefill_time * 1000
909
+ prefill_tokens_per_sec = context_pos / prefill_time if prefill_time > 0 else 0
910
+ print(f"{DARK_BLUE}{inference_tokens_per_sec:.1f} t/s, "
911
+ f"TTFT: {prefill_ms:.1f}ms ({prefill_tokens_per_sec:.1f} t/s), "
912
+ f"{len(response_tokens)} tokens{RESET_COLOR}")
913
+
914
+ if auto_prompt is not None:
915
+ break
916
+
917
+ except KeyboardInterrupt:
918
+ if not warmup:
919
+ print("\nGeneration interrupted")
920
+ token_printer.stop()
921
+ continue
922
+
923
+ except Exception as e:
924
+ if not warmup:
925
+ print(f"\nError in chat loop: {str(e)}")
926
+ import traceback
927
+ traceback.print_exc()
928
+
929
+ def main():
930
+ args = parse_args()
931
+ global DEBUG_LEVEL
932
+ DEBUG_LEVEL = args.debug_level
933
+
934
+ # Convert directory to absolute path
935
+ model_dir = Path(args.d).resolve()
936
+ if not model_dir.exists():
937
+ print(f"\nError: Model directory not found: {model_dir}")
938
+ return 1
939
+
940
+ print(f"\nUsing model directory: {model_dir}")
941
+ print(f"Context length: {args.context_length}")
942
+
943
+ try:
944
+ # Update paths to be relative to model directory
945
+ args.embed = str(model_dir / args.embed)
946
+ args.ffn = str(model_dir / args.ffn)
947
+ args.lmhead = str(model_dir / args.lmhead)
948
+
949
+ # Handle tokenizer path separately since it's not relative to model_dir
950
+ if args.tokenizer is None:
951
+ args.tokenizer = str(model_dir)
952
+
953
+ if not Path(args.tokenizer).exists():
954
+ print(f"\nError: Tokenizer directory not found: {args.tokenizer}")
955
+ return 1
956
+
957
+ args.tokenizer = str(Path(args.tokenizer).resolve()) # Convert to absolute path
958
+ print(f"Using tokenizer path: {args.tokenizer}")
959
+
960
+ metadata = {}
961
+ # Load models and extract metadata
962
+ embed_model, ffn_models, lmhead_model, metadata = load_models(args,metadata)
963
+
964
+ print(f"\nMetadata befor args.context_length: {metadata}")
965
+
966
+ # Override context length from command line if provided
967
+ if args.context_length is not None:
968
+ metadata['context_length'] = args.context_length
969
+ metadata['state_length'] = args.context_length # Also update state_length
970
+ print(f"\nOverriding context length from command line: {args.context_length}")
971
+
972
+ print(f"\nMetadata after load_models: {metadata}")
973
+
974
+ # Load tokenizer with resolved path
975
+ tokenizer = initialize_tokenizer(args.tokenizer)
976
+ if tokenizer is None:
977
+ raise RuntimeError("Failed to initialize tokenizer")
978
+
979
+ # Create unified state once
980
+ state = create_unified_state(ffn_models, metadata['context_length'])
981
+
982
+ # Initialize causal mask once
983
+ causal_mask = initialize_causal_mask(metadata['context_length'])
984
+
985
+ # Add split_lm_head to metadata for generate_next_token
986
+ metadata['split_lm_head'] = getattr(args, 'split_lm_head', 8)
987
+
988
+ # Warmup runs to prevent Python GIL issues with CoreML !
989
+ if not args.nw:
990
+ for i in range(2):
991
+ chat_loop(
992
+ embed_model=embed_model,
993
+ ffn_models=ffn_models,
994
+ lmhead_model=lmhead_model,
995
+ tokenizer=tokenizer,
996
+ metadata=metadata,
997
+ state=state, # Pass the state
998
+ causal_mask=causal_mask, # Pass the causal mask
999
+ warmup=True,
1000
+ auto_prompt="who are you?"
1001
+ )
1002
+
1003
+ # Main run
1004
+ chat_loop(
1005
+ embed_model=embed_model,
1006
+ ffn_models=ffn_models,
1007
+ lmhead_model=lmhead_model,
1008
+ tokenizer=tokenizer,
1009
+ metadata=metadata,
1010
+ state=state, # Pass the state
1011
+ causal_mask=causal_mask, # Pass the causal mask
1012
+ warmup=False,
1013
+ auto_prompt=args.prompt
1014
+ )
1015
+
1016
+ except Exception as e:
1017
+ print(f"\nError: {str(e)}")
1018
+ import traceback
1019
+ traceback.print_exc()
1020
+ return 1
1021
+
1022
+ return 0
1023
+
1024
+ if __name__ == "__main__":
1025
+ exit(main())
config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "LlamaTokenizer",
3
+ "model_type": "llama"
4
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
meta.yaml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_info:
2
+ name: anemll-Qwen3-1.7B-MLX-dequantized-ctx1024
3
+ version: 0.3.3
4
+ description: |
5
+ Demonstarates running Qwen3-1.7B-MLX-dequantized on Apple Neural Engine
6
+ Context length: 1024
7
+ Batch size: 64
8
+ Chunks: 1
9
+ license: MIT
10
+ author: Anemll
11
+ framework: Core ML
12
+ language: Python
13
+ architecture: qwen3
14
+ parameters:
15
+ context_length: 1024
16
+ batch_size: 64
17
+ lut_embeddings: none
18
+ lut_ffn: 6
19
+ lut_lmhead: 8
20
+ num_chunks: 1
21
+ model_prefix: qwen
22
+ embeddings: qwen_embeddings_lut8.mlmodelc
23
+ lm_head: qwen_lm_head_lut8.mlmodelc
24
+ ffn: qwen_FFN_PF_lut6.mlmodelc
25
+ split_lm_head: 16
qwen_FFN_PF_lut6_chunk_01of01.mlmodelc/analytics/coremldata.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a3119e44cd1884a02b6a9574f761f2a040b8a82915f94a4e3529ec55575ad3f
3
+ size 243
qwen_FFN_PF_lut6_chunk_01of01.mlmodelc/coremldata.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3fb73e1e3d6d12b4f47c120df83f34add75620e79ca3661872925ca76a60ffb
3
+ size 983
qwen_FFN_PF_lut6_chunk_01of01.mlmodelc/metadata.json ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "metadataOutputVersion" : "3.0",
4
+ "userDefinedMetadata" : {
5
+ "com.github.apple.coremltools.version" : "8.3.0",
6
+ "com.github.apple.coremltools.source_dialect" : "TorchScript",
7
+ "com.github.apple.coremltools.source" : "torch==2.5.0",
8
+ "com.anemll.context_length" : "1024",
9
+ "com.anemll.lut_bits" : "6",
10
+ "com.anemll.num_chunks" : "1",
11
+ "com.anemll.batch_size" : "64",
12
+ "com.anemll.info" : "Converted with Anemll v0.3.3",
13
+ "com.anemll.chunk_no" : "1"
14
+ },
15
+ "availability" : {
16
+ "macOS" : "15.0",
17
+ "tvOS" : "18.0",
18
+ "visionOS" : "2.0",
19
+ "watchOS" : "11.0",
20
+ "iOS" : "18.0",
21
+ "macCatalyst" : "18.0"
22
+ },
23
+ "inputSchema" : [
24
+ {
25
+ "hasShapeFlexibility" : "0",
26
+ "isOptional" : "0",
27
+ "dataType" : "Float16",
28
+ "formattedType" : "MultiArray (Float16 1 × 1 × 2048)",
29
+ "shortDescription" : "",
30
+ "shape" : "[1, 1, 2048]",
31
+ "name" : "hidden_states",
32
+ "type" : "MultiArray"
33
+ },
34
+ {
35
+ "hasShapeFlexibility" : "0",
36
+ "isOptional" : "0",
37
+ "dataType" : "Int32",
38
+ "formattedType" : "MultiArray (Int32 1)",
39
+ "shortDescription" : "",
40
+ "shape" : "[1]",
41
+ "name" : "position_ids",
42
+ "type" : "MultiArray"
43
+ },
44
+ {
45
+ "hasShapeFlexibility" : "0",
46
+ "isOptional" : "0",
47
+ "dataType" : "Float16",
48
+ "formattedType" : "MultiArray (Float16 1 × 1 × 1 × 1024)",
49
+ "shortDescription" : "",
50
+ "shape" : "[1, 1, 1, 1024]",
51
+ "name" : "causal_mask",
52
+ "type" : "MultiArray"
53
+ },
54
+ {
55
+ "hasShapeFlexibility" : "0",
56
+ "isOptional" : "0",
57
+ "dataType" : "Int32",
58
+ "formattedType" : "MultiArray (Int32 1)",
59
+ "shortDescription" : "",
60
+ "shape" : "[1]",
61
+ "name" : "current_pos",
62
+ "type" : "MultiArray"
63
+ }
64
+ ],
65
+ "outputSchema" : [
66
+ {
67
+ "hasShapeFlexibility" : "0",
68
+ "isOptional" : "0",
69
+ "dataType" : "Float16",
70
+ "formattedType" : "MultiArray (Float16 1 × 1 × 2048)",
71
+ "shortDescription" : "",
72
+ "shape" : "[1, 1, 2048]",
73
+ "name" : "output_hidden_states",
74
+ "type" : "MultiArray"
75
+ }
76
+ ],
77
+ "modelParameters" : [
78
+
79
+ ],
80
+ "storagePrecision" : "Mixed (Float16, Palettized (13 bits), Palettized (14 bits), Palettized (16 bits), UInt6)",
81
+ "method" : "predict",
82
+ "functions" : [
83
+ {
84
+ "inputSchema" : [
85
+ {
86
+ "hasShapeFlexibility" : "0",
87
+ "isOptional" : "0",
88
+ "dataType" : "Float16",
89
+ "formattedType" : "MultiArray (Float16 1 × 1 × 2048)",
90
+ "shortDescription" : "",
91
+ "shape" : "[1, 1, 2048]",
92
+ "name" : "hidden_states",
93
+ "type" : "MultiArray"
94
+ },
95
+ {
96
+ "hasShapeFlexibility" : "0",
97
+ "isOptional" : "0",
98
+ "dataType" : "Int32",
99
+ "formattedType" : "MultiArray (Int32 1)",
100
+ "shortDescription" : "",
101
+ "shape" : "[1]",
102
+ "name" : "position_ids",
103
+ "type" : "MultiArray"
104
+ },
105
+ {
106
+ "hasShapeFlexibility" : "0",
107
+ "isOptional" : "0",
108
+ "dataType" : "Float16",
109
+ "formattedType" : "MultiArray (Float16 1 × 1 × 1 × 1024)",
110
+ "shortDescription" : "",
111
+ "shape" : "[1, 1, 1, 1024]",
112
+ "name" : "causal_mask",
113
+ "type" : "MultiArray"
114
+ },
115
+ {
116
+ "hasShapeFlexibility" : "0",
117
+ "isOptional" : "0",
118
+ "dataType" : "Int32",
119
+ "formattedType" : "MultiArray (Int32 1)",
120
+ "shortDescription" : "",
121
+ "shape" : "[1]",
122
+ "name" : "current_pos",
123
+ "type" : "MultiArray"
124
+ }
125
+ ],
126
+ "computePrecision" : "Mixed (Float16, Int32)",
127
+ "storagePrecision" : "Mixed (Float16, Palettized (13 bits), Palettized (14 bits), Palettized (16 bits), UInt6)",
128
+ "stateSchema" : [
129
+ {
130
+ "dataType" : "Float16",
131
+ "isOptional" : "0",
132
+ "formattedType" : "State (Float16 56 × 8 × 1024 × 128)",
133
+ "shortDescription" : "",
134
+ "shape" : "[56, 8, 1024, 128]",
135
+ "name" : "model_model_kv_cache_0",
136
+ "type" : "State"
137
+ }
138
+ ],
139
+ "outputSchema" : [
140
+ {
141
+ "hasShapeFlexibility" : "0",
142
+ "isOptional" : "0",
143
+ "dataType" : "Float16",
144
+ "formattedType" : "MultiArray (Float16 1 × 1 × 2048)",
145
+ "shortDescription" : "",
146
+ "shape" : "[1, 1, 2048]",
147
+ "name" : "output_hidden_states",
148
+ "type" : "MultiArray"
149
+ }
150
+ ],
151
+ "name" : "infer",
152
+ "mlProgramOperationTypeHistogram" : {
153
+ "Ios18.expandDims" : 112,
154
+ "Ios18.mul" : 224,
155
+ "Ios18.softmax" : 28,
156
+ "Ios18.matmul" : 56,
157
+ "Identity" : 1,
158
+ "Ios16.reduceMean" : 113,
159
+ "Ios18.greaterEqual" : 1,
160
+ "Select" : 1,
161
+ "Ios18.readState" : 57,
162
+ "Tile" : 56,
163
+ "Ios18.gather" : 2,
164
+ "Ios18.add" : 142,
165
+ "Ios18.layerNorm" : 113,
166
+ "Ios18.sliceUpdate" : 56,
167
+ "Ios18.writeState" : 56,
168
+ "Ios18.reshape" : 170,
169
+ "Ios18.constexprLutToDense" : 196,
170
+ "Ios18.conv" : 196,
171
+ "Ios18.concat" : 168,
172
+ "Ios18.transpose" : 168,
173
+ "Ios18.sub" : 113,
174
+ "Ios18.silu" : 28,
175
+ "Ios18.sliceByIndex" : 168,
176
+ "Ios18.squeeze" : 84
177
+ }
178
+ },
179
+ {
180
+ "inputSchema" : [
181
+ {
182
+ "hasShapeFlexibility" : "0",
183
+ "isOptional" : "0",
184
+ "dataType" : "Float16",
185
+ "formattedType" : "MultiArray (Float16 1 × 64 × 2048)",
186
+ "shortDescription" : "",
187
+ "shape" : "[1, 64, 2048]",
188
+ "name" : "hidden_states",
189
+ "type" : "MultiArray"
190
+ },
191
+ {
192
+ "hasShapeFlexibility" : "0",
193
+ "isOptional" : "0",
194
+ "dataType" : "Int32",
195
+ "formattedType" : "MultiArray (Int32 64)",
196
+ "shortDescription" : "",
197
+ "shape" : "[64]",
198
+ "name" : "position_ids",
199
+ "type" : "MultiArray"
200
+ },
201
+ {
202
+ "hasShapeFlexibility" : "0",
203
+ "isOptional" : "0",
204
+ "dataType" : "Float16",
205
+ "formattedType" : "MultiArray (Float16 1 × 1 × 64 × 1024)",
206
+ "shortDescription" : "",
207
+ "shape" : "[1, 1, 64, 1024]",
208
+ "name" : "causal_mask",
209
+ "type" : "MultiArray"
210
+ },
211
+ {
212
+ "hasShapeFlexibility" : "0",
213
+ "isOptional" : "0",
214
+ "dataType" : "Int32",
215
+ "formattedType" : "MultiArray (Int32 1)",
216
+ "shortDescription" : "",
217
+ "shape" : "[1]",
218
+ "name" : "current_pos",
219
+ "type" : "MultiArray"
220
+ }
221
+ ],
222
+ "computePrecision" : "Mixed (Float16, Int32)",
223
+ "storagePrecision" : "Mixed (Float16, Palettized (13 bits), Palettized (14 bits), Palettized (16 bits), UInt6)",
224
+ "stateSchema" : [
225
+ {
226
+ "dataType" : "Float16",
227
+ "isOptional" : "0",
228
+ "formattedType" : "State (Float16 56 × 8 × 1024 × 128)",
229
+ "shortDescription" : "",
230
+ "shape" : "[56, 8, 1024, 128]",
231
+ "name" : "model_model_kv_cache_0",
232
+ "type" : "State"
233
+ }
234
+ ],
235
+ "outputSchema" : [
236
+ {
237
+ "hasShapeFlexibility" : "0",
238
+ "isOptional" : "0",
239
+ "dataType" : "Float16",
240
+ "formattedType" : "MultiArray (Float16 1 × 1 × 2048)",
241
+ "shortDescription" : "",
242
+ "shape" : "[1, 1, 2048]",
243
+ "name" : "output_hidden_states",
244
+ "type" : "MultiArray"
245
+ }
246
+ ],
247
+ "name" : "prefill",
248
+ "mlProgramOperationTypeHistogram" : {
249
+ "Ios18.expandDims" : 112,
250
+ "Ios18.mul" : 224,
251
+ "Ios18.softmax" : 28,
252
+ "Ios18.matmul" : 56,
253
+ "Ios16.reduceMean" : 112,
254
+ "Ios18.greaterEqual" : 1,
255
+ "Select" : 1,
256
+ "Ios18.readState" : 57,
257
+ "Tile" : 56,
258
+ "Ios18.gather" : 2,
259
+ "Ios18.add" : 142,
260
+ "Ios18.layerNorm" : 112,
261
+ "Ios18.sliceUpdate" : 56,
262
+ "Ios18.writeState" : 56,
263
+ "Ios18.reshape" : 226,
264
+ "Ios18.constexprLutToDense" : 196,
265
+ "Ios18.conv" : 196,
266
+ "Ios18.concat" : 168,
267
+ "Ios18.transpose" : 254,
268
+ "Ios18.sub" : 112,
269
+ "Ios18.silu" : 28,
270
+ "Ios18.sliceByIndex" : 169,
271
+ "Ios18.squeeze" : 84
272
+ }
273
+ }
274
+ ],
275
+ "version" : "0.3.3",
276
+ "isUpdatable" : "0",
277
+ "defaultFunctionName" : "infer",
278
+ "specificationVersion" : 9,
279
+ "stateSchema" : [
280
+ {
281
+ "dataType" : "Float16",
282
+ "isOptional" : "0",
283
+ "formattedType" : "State (Float16 56 × 8 × 1024 × 128)",
284
+ "shortDescription" : "",
285
+ "shape" : "[56, 8, 1024, 128]",
286
+ "name" : "model_model_kv_cache_0",
287
+ "type" : "State"
288
+ }
289
+ ],
290
+ "computePrecision" : "Mixed (Float16, Int32)",
291
+ "mlProgramOperationTypeHistogram" : {
292
+ "Ios18.expandDims" : 112,
293
+ "Ios18.mul" : 224,
294
+ "Ios18.softmax" : 28,
295
+ "Ios18.matmul" : 56,
296
+ "Identity" : 1,
297
+ "Ios16.reduceMean" : 113,
298
+ "Ios18.greaterEqual" : 1,
299
+ "Select" : 1,
300
+ "Ios18.readState" : 57,
301
+ "Tile" : 56,
302
+ "Ios18.gather" : 2,
303
+ "Ios18.add" : 142,
304
+ "Ios18.layerNorm" : 113,
305
+ "Ios18.sliceUpdate" : 56,
306
+ "Ios18.writeState" : 56,
307
+ "Ios18.reshape" : 170,
308
+ "Ios18.constexprLutToDense" : 196,
309
+ "Ios18.conv" : 196,
310
+ "Ios18.concat" : 168,
311
+ "Ios18.transpose" : 168,
312
+ "Ios18.sub" : 113,
313
+ "Ios18.silu" : 28,
314
+ "Ios18.sliceByIndex" : 168,
315
+ "Ios18.squeeze" : 84
316
+ },
317
+ "shortDescription" : "Anemll Model: Multifunction FFN+Prefill",
318
+ "generatedClassName" : "qwen_FFN_PF_lut6_chunk_01of01",
319
+ "author" : "Converted with Anemll v0.3.3",
320
+ "modelType" : {
321
+ "name" : "MLModelType_mlProgram"
322
+ }
323
+ }
324
+ ]
qwen_FFN_PF_lut6_chunk_01of01.mlmodelc/model.mil ADDED
The diff for this file is too large to render. See raw diff
 
qwen_FFN_PF_lut6_chunk_01of01.mlmodelc/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6834c66eaec14d9b5e2b6cda6678d21fad3b5e05a51a41d857a90bb796e5f2d4
3
+ size 1099974400
qwen_embeddings.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3963e7a362fe7f47d9409d557163def6ca4170e9003c05a2dd08b73ecf203cc
3
+ size 1511
qwen_embeddings.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d87ca38dd16d79cbda80d172efa2976900320241751eb91bc595bb3deb81307
3
+ size 320889024
qwen_embeddings.mlpackage/Manifest.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fileFormatVersion": "1.0.0",
3
+ "itemInfoEntries": {
4
+ "126E495B-F8F8-4B6C-A040-C0265D3CF5B9": {
5
+ "author": "com.apple.CoreML",
6
+ "description": "CoreML Model Specification",
7
+ "name": "model.mlmodel",
8
+ "path": "com.apple.CoreML/model.mlmodel"
9
+ },
10
+ "C6FAF843-A470-4AAA-A384-F66C7369B8AA": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Weights",
13
+ "name": "weights",
14
+ "path": "com.apple.CoreML/weights"
15
+ }
16
+ },
17
+ "rootModelIdentifier": "126E495B-F8F8-4B6C-A040-C0265D3CF5B9"
18
+ }
qwen_lm_head_lut8.mlmodelc/analytics/coremldata.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d13e9e5789fc2376b6ecb3c3f561257847a963fa5c2d91f4e47ab35e24ae98be
3
+ size 243
qwen_lm_head_lut8.mlmodelc/coremldata.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da309c2f9aeb2133dcf6105147d8aabf4d0652311ae5b0fdf98b9f97e3394c8e
3
+ size 898
qwen_lm_head_lut8.mlmodelc/metadata.json ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "shortDescription" : "Anemll Model (LM Head) converted to CoreML",
4
+ "metadataOutputVersion" : "3.0",
5
+ "outputSchema" : [
6
+ {
7
+ "hasShapeFlexibility" : "0",
8
+ "isOptional" : "0",
9
+ "dataType" : "Float16",
10
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
11
+ "shortDescription" : "",
12
+ "shape" : "[1, 1, 9496]",
13
+ "name" : "logits1",
14
+ "type" : "MultiArray"
15
+ },
16
+ {
17
+ "hasShapeFlexibility" : "0",
18
+ "isOptional" : "0",
19
+ "dataType" : "Float16",
20
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
21
+ "shortDescription" : "",
22
+ "shape" : "[1, 1, 9496]",
23
+ "name" : "logits2",
24
+ "type" : "MultiArray"
25
+ },
26
+ {
27
+ "hasShapeFlexibility" : "0",
28
+ "isOptional" : "0",
29
+ "dataType" : "Float16",
30
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
31
+ "shortDescription" : "",
32
+ "shape" : "[1, 1, 9496]",
33
+ "name" : "logits3",
34
+ "type" : "MultiArray"
35
+ },
36
+ {
37
+ "hasShapeFlexibility" : "0",
38
+ "isOptional" : "0",
39
+ "dataType" : "Float16",
40
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
41
+ "shortDescription" : "",
42
+ "shape" : "[1, 1, 9496]",
43
+ "name" : "logits4",
44
+ "type" : "MultiArray"
45
+ },
46
+ {
47
+ "hasShapeFlexibility" : "0",
48
+ "isOptional" : "0",
49
+ "dataType" : "Float16",
50
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
51
+ "shortDescription" : "",
52
+ "shape" : "[1, 1, 9496]",
53
+ "name" : "logits5",
54
+ "type" : "MultiArray"
55
+ },
56
+ {
57
+ "hasShapeFlexibility" : "0",
58
+ "isOptional" : "0",
59
+ "dataType" : "Float16",
60
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
61
+ "shortDescription" : "",
62
+ "shape" : "[1, 1, 9496]",
63
+ "name" : "logits6",
64
+ "type" : "MultiArray"
65
+ },
66
+ {
67
+ "hasShapeFlexibility" : "0",
68
+ "isOptional" : "0",
69
+ "dataType" : "Float16",
70
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
71
+ "shortDescription" : "",
72
+ "shape" : "[1, 1, 9496]",
73
+ "name" : "logits7",
74
+ "type" : "MultiArray"
75
+ },
76
+ {
77
+ "hasShapeFlexibility" : "0",
78
+ "isOptional" : "0",
79
+ "dataType" : "Float16",
80
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
81
+ "shortDescription" : "",
82
+ "shape" : "[1, 1, 9496]",
83
+ "name" : "logits8",
84
+ "type" : "MultiArray"
85
+ },
86
+ {
87
+ "hasShapeFlexibility" : "0",
88
+ "isOptional" : "0",
89
+ "dataType" : "Float16",
90
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
91
+ "shortDescription" : "",
92
+ "shape" : "[1, 1, 9496]",
93
+ "name" : "logits9",
94
+ "type" : "MultiArray"
95
+ },
96
+ {
97
+ "hasShapeFlexibility" : "0",
98
+ "isOptional" : "0",
99
+ "dataType" : "Float16",
100
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
101
+ "shortDescription" : "",
102
+ "shape" : "[1, 1, 9496]",
103
+ "name" : "logits10",
104
+ "type" : "MultiArray"
105
+ },
106
+ {
107
+ "hasShapeFlexibility" : "0",
108
+ "isOptional" : "0",
109
+ "dataType" : "Float16",
110
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
111
+ "shortDescription" : "",
112
+ "shape" : "[1, 1, 9496]",
113
+ "name" : "logits11",
114
+ "type" : "MultiArray"
115
+ },
116
+ {
117
+ "hasShapeFlexibility" : "0",
118
+ "isOptional" : "0",
119
+ "dataType" : "Float16",
120
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
121
+ "shortDescription" : "",
122
+ "shape" : "[1, 1, 9496]",
123
+ "name" : "logits12",
124
+ "type" : "MultiArray"
125
+ },
126
+ {
127
+ "hasShapeFlexibility" : "0",
128
+ "isOptional" : "0",
129
+ "dataType" : "Float16",
130
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
131
+ "shortDescription" : "",
132
+ "shape" : "[1, 1, 9496]",
133
+ "name" : "logits13",
134
+ "type" : "MultiArray"
135
+ },
136
+ {
137
+ "hasShapeFlexibility" : "0",
138
+ "isOptional" : "0",
139
+ "dataType" : "Float16",
140
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
141
+ "shortDescription" : "",
142
+ "shape" : "[1, 1, 9496]",
143
+ "name" : "logits14",
144
+ "type" : "MultiArray"
145
+ },
146
+ {
147
+ "hasShapeFlexibility" : "0",
148
+ "isOptional" : "0",
149
+ "dataType" : "Float16",
150
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
151
+ "shortDescription" : "",
152
+ "shape" : "[1, 1, 9496]",
153
+ "name" : "logits15",
154
+ "type" : "MultiArray"
155
+ },
156
+ {
157
+ "hasShapeFlexibility" : "0",
158
+ "isOptional" : "0",
159
+ "dataType" : "Float16",
160
+ "formattedType" : "MultiArray (Float16 1 × 1 × 9496)",
161
+ "shortDescription" : "",
162
+ "shape" : "[1, 1, 9496]",
163
+ "name" : "logits16",
164
+ "type" : "MultiArray"
165
+ }
166
+ ],
167
+ "version" : "0.3.3",
168
+ "modelParameters" : [
169
+
170
+ ],
171
+ "author" : "Converted with Anemll v0.3.3",
172
+ "specificationVersion" : 9,
173
+ "storagePrecision" : "Mixed (Float16, Palettized (19 bits), UInt8)",
174
+ "mlProgramOperationTypeHistogram" : {
175
+ "Ios18.transpose" : 17,
176
+ "Ios18.constexprLutToDense" : 16,
177
+ "Ios18.expandDims" : 1,
178
+ "Ios18.conv" : 16,
179
+ "Ios18.squeeze" : 16
180
+ },
181
+ "computePrecision" : "Mixed (Float16, Int32)",
182
+ "stateSchema" : [
183
+
184
+ ],
185
+ "isUpdatable" : "0",
186
+ "availability" : {
187
+ "macOS" : "15.0",
188
+ "tvOS" : "18.0",
189
+ "visionOS" : "2.0",
190
+ "watchOS" : "11.0",
191
+ "iOS" : "18.0",
192
+ "macCatalyst" : "18.0"
193
+ },
194
+ "modelType" : {
195
+ "name" : "MLModelType_mlProgram"
196
+ },
197
+ "inputSchema" : [
198
+ {
199
+ "hasShapeFlexibility" : "0",
200
+ "isOptional" : "0",
201
+ "dataType" : "Float16",
202
+ "formattedType" : "MultiArray (Float16 1 × 1 × 2048)",
203
+ "shortDescription" : "",
204
+ "shape" : "[1, 1, 2048]",
205
+ "name" : "hidden_states",
206
+ "type" : "MultiArray"
207
+ }
208
+ ],
209
+ "userDefinedMetadata" : {
210
+ "com.anemll.info" : "Converted with Anemll v0.3.3",
211
+ "com.github.apple.coremltools.source_dialect" : "TorchScript",
212
+ "com.anemll.lut_bits" : "8",
213
+ "com.github.apple.coremltools.source" : "torch==2.5.0",
214
+ "com.github.apple.coremltools.version" : "8.3.0",
215
+ "com.anemll.context_length" : "1024"
216
+ },
217
+ "generatedClassName" : "qwen_lm_head_lut8",
218
+ "method" : "predict"
219
+ }
220
+ ]
qwen_lm_head_lut8.mlmodelc/model.mil ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ program(1.3)
2
+ [buildInfo = dict<string, string>({{"coremlc-component-MIL", "3500.11.1"}, {"coremlc-version", "3500.21.1"}})]
3
+ {
4
+ func main<ios18>(tensor<fp16, [1, 1, 2048]> hidden_states) {
5
+ tensor<int32, [3]> var_5 = const()[name = string("op_5"), val = tensor<int32, [3]>([0, 2, 1])];
6
+ tensor<int32, [1]> input_axes_0 = const()[name = string("input_axes_0"), val = tensor<int32, [1]>([2])];
7
+ tensor<fp16, [1, 2048, 1]> var_6_cast_fp16 = transpose(perm = var_5, x = hidden_states)[name = string("transpose_16")];
8
+ tensor<fp16, [1, 2048, 1, 1]> input_cast_fp16 = expand_dims(axes = input_axes_0, x = var_6_cast_fp16)[name = string("input_cast_fp16")];
9
+ string var_29_pad_type_0 = const()[name = string("op_29_pad_type_0"), val = string("valid")];
10
+ tensor<int32, [2]> var_29_strides_0 = const()[name = string("op_29_strides_0"), val = tensor<int32, [2]>([1, 1])];
11
+ tensor<int32, [4]> var_29_pad_0 = const()[name = string("op_29_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
12
+ tensor<int32, [2]> var_29_dilations_0 = const()[name = string("op_29_dilations_0"), val = tensor<int32, [2]>([1, 1])];
13
+ int32 var_29_groups_0 = const()[name = string("op_29_groups_0"), val = int32(1)];
14
+ tensor<fp16, [9496, 2048, 1, 1]> op_9_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(64))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(19447936))))[name = string("op_9_promoted_to_fp16_palettized")];
15
+ tensor<fp16, [1, 9496, 1, 1]> var_29_cast_fp16 = conv(dilations = var_29_dilations_0, groups = var_29_groups_0, pad = var_29_pad_0, pad_type = var_29_pad_type_0, strides = var_29_strides_0, weight = op_9_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_29_cast_fp16")];
16
+ tensor<int32, [1]> var_31_axes_0 = const()[name = string("op_31_axes_0"), val = tensor<int32, [1]>([2])];
17
+ tensor<fp16, [1, 9496, 1]> var_31_cast_fp16 = squeeze(axes = var_31_axes_0, x = var_29_cast_fp16)[name = string("op_31_cast_fp16")];
18
+ tensor<int32, [3]> var_34_perm_0 = const()[name = string("op_34_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
19
+ string var_55_pad_type_0 = const()[name = string("op_55_pad_type_0"), val = string("valid")];
20
+ tensor<int32, [2]> var_55_strides_0 = const()[name = string("op_55_strides_0"), val = tensor<int32, [2]>([1, 1])];
21
+ tensor<int32, [4]> var_55_pad_0 = const()[name = string("op_55_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
22
+ tensor<int32, [2]> var_55_dilations_0 = const()[name = string("op_55_dilations_0"), val = tensor<int32, [2]>([1, 1])];
23
+ int32 var_55_groups_0 = const()[name = string("op_55_groups_0"), val = int32(1)];
24
+ tensor<fp16, [9496, 2048, 1, 1]> op_35_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(20055744))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(39503616))))[name = string("op_35_promoted_to_fp16_palettized")];
25
+ tensor<fp16, [1, 9496, 1, 1]> var_55_cast_fp16 = conv(dilations = var_55_dilations_0, groups = var_55_groups_0, pad = var_55_pad_0, pad_type = var_55_pad_type_0, strides = var_55_strides_0, weight = op_35_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_55_cast_fp16")];
26
+ tensor<int32, [1]> var_57_axes_0 = const()[name = string("op_57_axes_0"), val = tensor<int32, [1]>([2])];
27
+ tensor<fp16, [1, 9496, 1]> var_57_cast_fp16 = squeeze(axes = var_57_axes_0, x = var_55_cast_fp16)[name = string("op_57_cast_fp16")];
28
+ tensor<int32, [3]> var_60_perm_0 = const()[name = string("op_60_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
29
+ string var_81_pad_type_0 = const()[name = string("op_81_pad_type_0"), val = string("valid")];
30
+ tensor<int32, [2]> var_81_strides_0 = const()[name = string("op_81_strides_0"), val = tensor<int32, [2]>([1, 1])];
31
+ tensor<int32, [4]> var_81_pad_0 = const()[name = string("op_81_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
32
+ tensor<int32, [2]> var_81_dilations_0 = const()[name = string("op_81_dilations_0"), val = tensor<int32, [2]>([1, 1])];
33
+ int32 var_81_groups_0 = const()[name = string("op_81_groups_0"), val = int32(1)];
34
+ tensor<fp16, [9496, 2048, 1, 1]> op_61_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(40111424))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(59559296))))[name = string("op_61_promoted_to_fp16_palettized")];
35
+ tensor<fp16, [1, 9496, 1, 1]> var_81_cast_fp16 = conv(dilations = var_81_dilations_0, groups = var_81_groups_0, pad = var_81_pad_0, pad_type = var_81_pad_type_0, strides = var_81_strides_0, weight = op_61_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_81_cast_fp16")];
36
+ tensor<int32, [1]> var_83_axes_0 = const()[name = string("op_83_axes_0"), val = tensor<int32, [1]>([2])];
37
+ tensor<fp16, [1, 9496, 1]> var_83_cast_fp16 = squeeze(axes = var_83_axes_0, x = var_81_cast_fp16)[name = string("op_83_cast_fp16")];
38
+ tensor<int32, [3]> var_86_perm_0 = const()[name = string("op_86_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
39
+ string var_107_pad_type_0 = const()[name = string("op_107_pad_type_0"), val = string("valid")];
40
+ tensor<int32, [2]> var_107_strides_0 = const()[name = string("op_107_strides_0"), val = tensor<int32, [2]>([1, 1])];
41
+ tensor<int32, [4]> var_107_pad_0 = const()[name = string("op_107_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
42
+ tensor<int32, [2]> var_107_dilations_0 = const()[name = string("op_107_dilations_0"), val = tensor<int32, [2]>([1, 1])];
43
+ int32 var_107_groups_0 = const()[name = string("op_107_groups_0"), val = int32(1)];
44
+ tensor<fp16, [9496, 2048, 1, 1]> op_87_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(60167104))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(79614976))))[name = string("op_87_promoted_to_fp16_palettized")];
45
+ tensor<fp16, [1, 9496, 1, 1]> var_107_cast_fp16 = conv(dilations = var_107_dilations_0, groups = var_107_groups_0, pad = var_107_pad_0, pad_type = var_107_pad_type_0, strides = var_107_strides_0, weight = op_87_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_107_cast_fp16")];
46
+ tensor<int32, [1]> var_109_axes_0 = const()[name = string("op_109_axes_0"), val = tensor<int32, [1]>([2])];
47
+ tensor<fp16, [1, 9496, 1]> var_109_cast_fp16 = squeeze(axes = var_109_axes_0, x = var_107_cast_fp16)[name = string("op_109_cast_fp16")];
48
+ tensor<int32, [3]> var_112_perm_0 = const()[name = string("op_112_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
49
+ string var_133_pad_type_0 = const()[name = string("op_133_pad_type_0"), val = string("valid")];
50
+ tensor<int32, [2]> var_133_strides_0 = const()[name = string("op_133_strides_0"), val = tensor<int32, [2]>([1, 1])];
51
+ tensor<int32, [4]> var_133_pad_0 = const()[name = string("op_133_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
52
+ tensor<int32, [2]> var_133_dilations_0 = const()[name = string("op_133_dilations_0"), val = tensor<int32, [2]>([1, 1])];
53
+ int32 var_133_groups_0 = const()[name = string("op_133_groups_0"), val = int32(1)];
54
+ tensor<fp16, [9496, 2048, 1, 1]> op_113_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(80222784))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(99670656))))[name = string("op_113_promoted_to_fp16_palettized")];
55
+ tensor<fp16, [1, 9496, 1, 1]> var_133_cast_fp16 = conv(dilations = var_133_dilations_0, groups = var_133_groups_0, pad = var_133_pad_0, pad_type = var_133_pad_type_0, strides = var_133_strides_0, weight = op_113_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_133_cast_fp16")];
56
+ tensor<int32, [1]> var_135_axes_0 = const()[name = string("op_135_axes_0"), val = tensor<int32, [1]>([2])];
57
+ tensor<fp16, [1, 9496, 1]> var_135_cast_fp16 = squeeze(axes = var_135_axes_0, x = var_133_cast_fp16)[name = string("op_135_cast_fp16")];
58
+ tensor<int32, [3]> var_138_perm_0 = const()[name = string("op_138_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
59
+ string var_159_pad_type_0 = const()[name = string("op_159_pad_type_0"), val = string("valid")];
60
+ tensor<int32, [2]> var_159_strides_0 = const()[name = string("op_159_strides_0"), val = tensor<int32, [2]>([1, 1])];
61
+ tensor<int32, [4]> var_159_pad_0 = const()[name = string("op_159_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
62
+ tensor<int32, [2]> var_159_dilations_0 = const()[name = string("op_159_dilations_0"), val = tensor<int32, [2]>([1, 1])];
63
+ int32 var_159_groups_0 = const()[name = string("op_159_groups_0"), val = int32(1)];
64
+ tensor<fp16, [9496, 2048, 1, 1]> op_139_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(100278464))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(119726336))))[name = string("op_139_promoted_to_fp16_palettized")];
65
+ tensor<fp16, [1, 9496, 1, 1]> var_159_cast_fp16 = conv(dilations = var_159_dilations_0, groups = var_159_groups_0, pad = var_159_pad_0, pad_type = var_159_pad_type_0, strides = var_159_strides_0, weight = op_139_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_159_cast_fp16")];
66
+ tensor<int32, [1]> var_161_axes_0 = const()[name = string("op_161_axes_0"), val = tensor<int32, [1]>([2])];
67
+ tensor<fp16, [1, 9496, 1]> var_161_cast_fp16 = squeeze(axes = var_161_axes_0, x = var_159_cast_fp16)[name = string("op_161_cast_fp16")];
68
+ tensor<int32, [3]> var_164_perm_0 = const()[name = string("op_164_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
69
+ string var_185_pad_type_0 = const()[name = string("op_185_pad_type_0"), val = string("valid")];
70
+ tensor<int32, [2]> var_185_strides_0 = const()[name = string("op_185_strides_0"), val = tensor<int32, [2]>([1, 1])];
71
+ tensor<int32, [4]> var_185_pad_0 = const()[name = string("op_185_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
72
+ tensor<int32, [2]> var_185_dilations_0 = const()[name = string("op_185_dilations_0"), val = tensor<int32, [2]>([1, 1])];
73
+ int32 var_185_groups_0 = const()[name = string("op_185_groups_0"), val = int32(1)];
74
+ tensor<fp16, [9496, 2048, 1, 1]> op_165_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(120334144))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(139782016))))[name = string("op_165_promoted_to_fp16_palettized")];
75
+ tensor<fp16, [1, 9496, 1, 1]> var_185_cast_fp16 = conv(dilations = var_185_dilations_0, groups = var_185_groups_0, pad = var_185_pad_0, pad_type = var_185_pad_type_0, strides = var_185_strides_0, weight = op_165_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_185_cast_fp16")];
76
+ tensor<int32, [1]> var_187_axes_0 = const()[name = string("op_187_axes_0"), val = tensor<int32, [1]>([2])];
77
+ tensor<fp16, [1, 9496, 1]> var_187_cast_fp16 = squeeze(axes = var_187_axes_0, x = var_185_cast_fp16)[name = string("op_187_cast_fp16")];
78
+ tensor<int32, [3]> var_190_perm_0 = const()[name = string("op_190_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
79
+ string var_211_pad_type_0 = const()[name = string("op_211_pad_type_0"), val = string("valid")];
80
+ tensor<int32, [2]> var_211_strides_0 = const()[name = string("op_211_strides_0"), val = tensor<int32, [2]>([1, 1])];
81
+ tensor<int32, [4]> var_211_pad_0 = const()[name = string("op_211_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
82
+ tensor<int32, [2]> var_211_dilations_0 = const()[name = string("op_211_dilations_0"), val = tensor<int32, [2]>([1, 1])];
83
+ int32 var_211_groups_0 = const()[name = string("op_211_groups_0"), val = int32(1)];
84
+ tensor<fp16, [9496, 2048, 1, 1]> op_191_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(140389824))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(159837696))))[name = string("op_191_promoted_to_fp16_palettized")];
85
+ tensor<fp16, [1, 9496, 1, 1]> var_211_cast_fp16 = conv(dilations = var_211_dilations_0, groups = var_211_groups_0, pad = var_211_pad_0, pad_type = var_211_pad_type_0, strides = var_211_strides_0, weight = op_191_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_211_cast_fp16")];
86
+ tensor<int32, [1]> var_213_axes_0 = const()[name = string("op_213_axes_0"), val = tensor<int32, [1]>([2])];
87
+ tensor<fp16, [1, 9496, 1]> var_213_cast_fp16 = squeeze(axes = var_213_axes_0, x = var_211_cast_fp16)[name = string("op_213_cast_fp16")];
88
+ tensor<int32, [3]> var_216_perm_0 = const()[name = string("op_216_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
89
+ string var_237_pad_type_0 = const()[name = string("op_237_pad_type_0"), val = string("valid")];
90
+ tensor<int32, [2]> var_237_strides_0 = const()[name = string("op_237_strides_0"), val = tensor<int32, [2]>([1, 1])];
91
+ tensor<int32, [4]> var_237_pad_0 = const()[name = string("op_237_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
92
+ tensor<int32, [2]> var_237_dilations_0 = const()[name = string("op_237_dilations_0"), val = tensor<int32, [2]>([1, 1])];
93
+ int32 var_237_groups_0 = const()[name = string("op_237_groups_0"), val = int32(1)];
94
+ tensor<fp16, [9496, 2048, 1, 1]> op_217_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(160445504))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(179893376))))[name = string("op_217_promoted_to_fp16_palettized")];
95
+ tensor<fp16, [1, 9496, 1, 1]> var_237_cast_fp16 = conv(dilations = var_237_dilations_0, groups = var_237_groups_0, pad = var_237_pad_0, pad_type = var_237_pad_type_0, strides = var_237_strides_0, weight = op_217_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_237_cast_fp16")];
96
+ tensor<int32, [1]> var_239_axes_0 = const()[name = string("op_239_axes_0"), val = tensor<int32, [1]>([2])];
97
+ tensor<fp16, [1, 9496, 1]> var_239_cast_fp16 = squeeze(axes = var_239_axes_0, x = var_237_cast_fp16)[name = string("op_239_cast_fp16")];
98
+ tensor<int32, [3]> var_242_perm_0 = const()[name = string("op_242_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
99
+ string var_263_pad_type_0 = const()[name = string("op_263_pad_type_0"), val = string("valid")];
100
+ tensor<int32, [2]> var_263_strides_0 = const()[name = string("op_263_strides_0"), val = tensor<int32, [2]>([1, 1])];
101
+ tensor<int32, [4]> var_263_pad_0 = const()[name = string("op_263_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
102
+ tensor<int32, [2]> var_263_dilations_0 = const()[name = string("op_263_dilations_0"), val = tensor<int32, [2]>([1, 1])];
103
+ int32 var_263_groups_0 = const()[name = string("op_263_groups_0"), val = int32(1)];
104
+ tensor<fp16, [9496, 2048, 1, 1]> op_243_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(180501184))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(199949056))))[name = string("op_243_promoted_to_fp16_palettized")];
105
+ tensor<fp16, [1, 9496, 1, 1]> var_263_cast_fp16 = conv(dilations = var_263_dilations_0, groups = var_263_groups_0, pad = var_263_pad_0, pad_type = var_263_pad_type_0, strides = var_263_strides_0, weight = op_243_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_263_cast_fp16")];
106
+ tensor<int32, [1]> var_265_axes_0 = const()[name = string("op_265_axes_0"), val = tensor<int32, [1]>([2])];
107
+ tensor<fp16, [1, 9496, 1]> var_265_cast_fp16 = squeeze(axes = var_265_axes_0, x = var_263_cast_fp16)[name = string("op_265_cast_fp16")];
108
+ tensor<int32, [3]> var_268_perm_0 = const()[name = string("op_268_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
109
+ string var_289_pad_type_0 = const()[name = string("op_289_pad_type_0"), val = string("valid")];
110
+ tensor<int32, [2]> var_289_strides_0 = const()[name = string("op_289_strides_0"), val = tensor<int32, [2]>([1, 1])];
111
+ tensor<int32, [4]> var_289_pad_0 = const()[name = string("op_289_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
112
+ tensor<int32, [2]> var_289_dilations_0 = const()[name = string("op_289_dilations_0"), val = tensor<int32, [2]>([1, 1])];
113
+ int32 var_289_groups_0 = const()[name = string("op_289_groups_0"), val = int32(1)];
114
+ tensor<fp16, [9496, 2048, 1, 1]> op_269_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(200556864))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(220004736))))[name = string("op_269_promoted_to_fp16_palettized")];
115
+ tensor<fp16, [1, 9496, 1, 1]> var_289_cast_fp16 = conv(dilations = var_289_dilations_0, groups = var_289_groups_0, pad = var_289_pad_0, pad_type = var_289_pad_type_0, strides = var_289_strides_0, weight = op_269_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_289_cast_fp16")];
116
+ tensor<int32, [1]> var_291_axes_0 = const()[name = string("op_291_axes_0"), val = tensor<int32, [1]>([2])];
117
+ tensor<fp16, [1, 9496, 1]> var_291_cast_fp16 = squeeze(axes = var_291_axes_0, x = var_289_cast_fp16)[name = string("op_291_cast_fp16")];
118
+ tensor<int32, [3]> var_294_perm_0 = const()[name = string("op_294_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
119
+ string var_315_pad_type_0 = const()[name = string("op_315_pad_type_0"), val = string("valid")];
120
+ tensor<int32, [2]> var_315_strides_0 = const()[name = string("op_315_strides_0"), val = tensor<int32, [2]>([1, 1])];
121
+ tensor<int32, [4]> var_315_pad_0 = const()[name = string("op_315_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
122
+ tensor<int32, [2]> var_315_dilations_0 = const()[name = string("op_315_dilations_0"), val = tensor<int32, [2]>([1, 1])];
123
+ int32 var_315_groups_0 = const()[name = string("op_315_groups_0"), val = int32(1)];
124
+ tensor<fp16, [9496, 2048, 1, 1]> op_295_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(220612544))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(240060416))))[name = string("op_295_promoted_to_fp16_palettized")];
125
+ tensor<fp16, [1, 9496, 1, 1]> var_315_cast_fp16 = conv(dilations = var_315_dilations_0, groups = var_315_groups_0, pad = var_315_pad_0, pad_type = var_315_pad_type_0, strides = var_315_strides_0, weight = op_295_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_315_cast_fp16")];
126
+ tensor<int32, [1]> var_317_axes_0 = const()[name = string("op_317_axes_0"), val = tensor<int32, [1]>([2])];
127
+ tensor<fp16, [1, 9496, 1]> var_317_cast_fp16 = squeeze(axes = var_317_axes_0, x = var_315_cast_fp16)[name = string("op_317_cast_fp16")];
128
+ tensor<int32, [3]> var_320_perm_0 = const()[name = string("op_320_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
129
+ string var_341_pad_type_0 = const()[name = string("op_341_pad_type_0"), val = string("valid")];
130
+ tensor<int32, [2]> var_341_strides_0 = const()[name = string("op_341_strides_0"), val = tensor<int32, [2]>([1, 1])];
131
+ tensor<int32, [4]> var_341_pad_0 = const()[name = string("op_341_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
132
+ tensor<int32, [2]> var_341_dilations_0 = const()[name = string("op_341_dilations_0"), val = tensor<int32, [2]>([1, 1])];
133
+ int32 var_341_groups_0 = const()[name = string("op_341_groups_0"), val = int32(1)];
134
+ tensor<fp16, [9496, 2048, 1, 1]> op_321_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(240668224))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(260116096))))[name = string("op_321_promoted_to_fp16_palettized")];
135
+ tensor<fp16, [1, 9496, 1, 1]> var_341_cast_fp16 = conv(dilations = var_341_dilations_0, groups = var_341_groups_0, pad = var_341_pad_0, pad_type = var_341_pad_type_0, strides = var_341_strides_0, weight = op_321_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_341_cast_fp16")];
136
+ tensor<int32, [1]> var_343_axes_0 = const()[name = string("op_343_axes_0"), val = tensor<int32, [1]>([2])];
137
+ tensor<fp16, [1, 9496, 1]> var_343_cast_fp16 = squeeze(axes = var_343_axes_0, x = var_341_cast_fp16)[name = string("op_343_cast_fp16")];
138
+ tensor<int32, [3]> var_346_perm_0 = const()[name = string("op_346_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
139
+ string var_367_pad_type_0 = const()[name = string("op_367_pad_type_0"), val = string("valid")];
140
+ tensor<int32, [2]> var_367_strides_0 = const()[name = string("op_367_strides_0"), val = tensor<int32, [2]>([1, 1])];
141
+ tensor<int32, [4]> var_367_pad_0 = const()[name = string("op_367_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
142
+ tensor<int32, [2]> var_367_dilations_0 = const()[name = string("op_367_dilations_0"), val = tensor<int32, [2]>([1, 1])];
143
+ int32 var_367_groups_0 = const()[name = string("op_367_groups_0"), val = int32(1)];
144
+ tensor<fp16, [9496, 2048, 1, 1]> op_347_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(260723904))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(280171776))))[name = string("op_347_promoted_to_fp16_palettized")];
145
+ tensor<fp16, [1, 9496, 1, 1]> var_367_cast_fp16 = conv(dilations = var_367_dilations_0, groups = var_367_groups_0, pad = var_367_pad_0, pad_type = var_367_pad_type_0, strides = var_367_strides_0, weight = op_347_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_367_cast_fp16")];
146
+ tensor<int32, [1]> var_369_axes_0 = const()[name = string("op_369_axes_0"), val = tensor<int32, [1]>([2])];
147
+ tensor<fp16, [1, 9496, 1]> var_369_cast_fp16 = squeeze(axes = var_369_axes_0, x = var_367_cast_fp16)[name = string("op_369_cast_fp16")];
148
+ tensor<int32, [3]> var_372_perm_0 = const()[name = string("op_372_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
149
+ string var_393_pad_type_0 = const()[name = string("op_393_pad_type_0"), val = string("valid")];
150
+ tensor<int32, [2]> var_393_strides_0 = const()[name = string("op_393_strides_0"), val = tensor<int32, [2]>([1, 1])];
151
+ tensor<int32, [4]> var_393_pad_0 = const()[name = string("op_393_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
152
+ tensor<int32, [2]> var_393_dilations_0 = const()[name = string("op_393_dilations_0"), val = tensor<int32, [2]>([1, 1])];
153
+ int32 var_393_groups_0 = const()[name = string("op_393_groups_0"), val = int32(1)];
154
+ tensor<fp16, [9496, 2048, 1, 1]> op_373_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(280779584))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(300227456))))[name = string("op_373_promoted_to_fp16_palettized")];
155
+ tensor<fp16, [1, 9496, 1, 1]> var_393_cast_fp16 = conv(dilations = var_393_dilations_0, groups = var_393_groups_0, pad = var_393_pad_0, pad_type = var_393_pad_type_0, strides = var_393_strides_0, weight = op_373_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_393_cast_fp16")];
156
+ tensor<int32, [1]> var_395_axes_0 = const()[name = string("op_395_axes_0"), val = tensor<int32, [1]>([2])];
157
+ tensor<fp16, [1, 9496, 1]> var_395_cast_fp16 = squeeze(axes = var_395_axes_0, x = var_393_cast_fp16)[name = string("op_395_cast_fp16")];
158
+ tensor<int32, [3]> var_398_perm_0 = const()[name = string("op_398_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
159
+ string var_419_pad_type_0 = const()[name = string("op_419_pad_type_0"), val = string("valid")];
160
+ tensor<int32, [2]> var_419_strides_0 = const()[name = string("op_419_strides_0"), val = tensor<int32, [2]>([1, 1])];
161
+ tensor<int32, [4]> var_419_pad_0 = const()[name = string("op_419_pad_0"), val = tensor<int32, [4]>([0, 0, 0, 0])];
162
+ tensor<int32, [2]> var_419_dilations_0 = const()[name = string("op_419_dilations_0"), val = tensor<int32, [2]>([1, 1])];
163
+ int32 var_419_groups_0 = const()[name = string("op_419_groups_0"), val = int32(1)];
164
+ tensor<fp16, [9496, 2048, 1, 1]> op_399_promoted_to_fp16_palettized = constexpr_lut_to_dense(indices = tensor<uint8, [9496, 2048, 1, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(300835264))), lut = tensor<fp16, [1187, 1, 1, 1, 256, 1]>(BLOBFILE(path = string("@model_path/weights/weight.bin"), offset = uint64(320283136))))[name = string("op_399_promoted_to_fp16_palettized")];
165
+ tensor<fp16, [1, 9496, 1, 1]> var_419_cast_fp16 = conv(dilations = var_419_dilations_0, groups = var_419_groups_0, pad = var_419_pad_0, pad_type = var_419_pad_type_0, strides = var_419_strides_0, weight = op_399_promoted_to_fp16_palettized, x = input_cast_fp16)[name = string("op_419_cast_fp16")];
166
+ tensor<int32, [1]> var_421_axes_0 = const()[name = string("op_421_axes_0"), val = tensor<int32, [1]>([2])];
167
+ tensor<fp16, [1, 9496, 1]> var_421_cast_fp16 = squeeze(axes = var_421_axes_0, x = var_419_cast_fp16)[name = string("op_421_cast_fp16")];
168
+ tensor<int32, [3]> var_424_perm_0 = const()[name = string("op_424_perm_0"), val = tensor<int32, [3]>([0, 2, 1])];
169
+ tensor<fp16, [1, 1, 9496]> logits1 = transpose(perm = var_34_perm_0, x = var_31_cast_fp16)[name = string("transpose_0")];
170
+ tensor<fp16, [1, 1, 9496]> logits2 = transpose(perm = var_60_perm_0, x = var_57_cast_fp16)[name = string("transpose_1")];
171
+ tensor<fp16, [1, 1, 9496]> logits3 = transpose(perm = var_86_perm_0, x = var_83_cast_fp16)[name = string("transpose_2")];
172
+ tensor<fp16, [1, 1, 9496]> logits4 = transpose(perm = var_112_perm_0, x = var_109_cast_fp16)[name = string("transpose_3")];
173
+ tensor<fp16, [1, 1, 9496]> logits5 = transpose(perm = var_138_perm_0, x = var_135_cast_fp16)[name = string("transpose_4")];
174
+ tensor<fp16, [1, 1, 9496]> logits6 = transpose(perm = var_164_perm_0, x = var_161_cast_fp16)[name = string("transpose_5")];
175
+ tensor<fp16, [1, 1, 9496]> logits7 = transpose(perm = var_190_perm_0, x = var_187_cast_fp16)[name = string("transpose_6")];
176
+ tensor<fp16, [1, 1, 9496]> logits8 = transpose(perm = var_216_perm_0, x = var_213_cast_fp16)[name = string("transpose_7")];
177
+ tensor<fp16, [1, 1, 9496]> logits9 = transpose(perm = var_242_perm_0, x = var_239_cast_fp16)[name = string("transpose_8")];
178
+ tensor<fp16, [1, 1, 9496]> logits10 = transpose(perm = var_268_perm_0, x = var_265_cast_fp16)[name = string("transpose_9")];
179
+ tensor<fp16, [1, 1, 9496]> logits11 = transpose(perm = var_294_perm_0, x = var_291_cast_fp16)[name = string("transpose_10")];
180
+ tensor<fp16, [1, 1, 9496]> logits12 = transpose(perm = var_320_perm_0, x = var_317_cast_fp16)[name = string("transpose_11")];
181
+ tensor<fp16, [1, 1, 9496]> logits13 = transpose(perm = var_346_perm_0, x = var_343_cast_fp16)[name = string("transpose_12")];
182
+ tensor<fp16, [1, 1, 9496]> logits14 = transpose(perm = var_372_perm_0, x = var_369_cast_fp16)[name = string("transpose_13")];
183
+ tensor<fp16, [1, 1, 9496]> logits15 = transpose(perm = var_398_perm_0, x = var_395_cast_fp16)[name = string("transpose_14")];
184
+ tensor<fp16, [1, 1, 9496]> logits16 = transpose(perm = var_424_perm_0, x = var_421_cast_fp16)[name = string("transpose_15")];
185
+ } -> (logits1, logits2, logits3, logits4, logits5, logits6, logits7, logits8, logits9, logits10, logits11, logits12, logits13, logits14, logits15, logits16);
186
+ }
qwen_lm_head_lut8.mlmodelc/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86193d85980d0300e663a2b7e81690c60677accf69db5ccbaf072671a7a55ff5
3
+ size 320890944
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- '' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" and not message.tool_calls %}\n {%- set content = message.content %}\n {%- if not loop.last %}\n {%- set content = message.content.split('</think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set content = message.content %}\n {%- if not loop.last %}\n {%- set content = message.content.split('</think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
231
+ "clean_up_tokenization_spaces": false,
232
+ "eos_token": "<|im_end|>",
233
+ "errors": "replace",
234
+ "extra_special_tokens": {},
235
+ "model_max_length": 131072,
236
+ "pad_token": "<|endoftext|>",
237
+ "split_special_tokens": false,
238
+ "tokenizer_class": "Qwen2Tokenizer",
239
+ "unk_token": null
240
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff