Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -21,34 +21,7 @@ language:
|
|
21 |
|
22 |
# Running in a mobile app
|
23 |
(TODO: pte file name generation)
|
24 |
-
The [pte file](https://huggingface.co/
|
25 |
-
(model): Qwen3Model(
|
26 |
-
(embed_tokens): Embedding(151936, 4096)
|
27 |
-
(layers): ModuleList(
|
28 |
-
(0-35): 36 x Qwen3DecoderLayer(
|
29 |
-
(self_attn): Qwen3Attention(
|
30 |
-
(q_proj): Linear(in_features=4096, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
|
31 |
-
(k_proj): Linear(in_features=4096, out_features=1024, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([1024, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
|
32 |
-
(v_proj): Linear(in_features=4096, out_features=1024, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([1024, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
|
33 |
-
(o_proj): Linear(in_features=4096, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
|
34 |
-
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
|
35 |
-
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
|
36 |
-
)
|
37 |
-
(mlp): Qwen3MLP(
|
38 |
-
(gate_proj): Linear(in_features=4096, out_features=12288, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([12288, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
|
39 |
-
(up_proj): Linear(in_features=4096, out_features=12288, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([12288, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
|
40 |
-
(down_proj): Linear(in_features=12288, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 12288]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
|
41 |
-
(act_fn): SiLU()
|
42 |
-
)
|
43 |
-
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
|
44 |
-
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
|
45 |
-
)
|
46 |
-
)
|
47 |
-
(norm): Qwen3RMSNorm((4096,), eps=1e-06)
|
48 |
-
(rotary_emb): Qwen3RotaryEmbedding()
|
49 |
-
)
|
50 |
-
(lm_head): Linear(in_features=4096, out_features=151936, bias=False)
|
51 |
-
)/blob/main/qwen3-4B-INT8-INT4-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
52 |
On iPhone 15 Pro, the model runs at (to be filled) tokens/sec and uses (to be filled) Mb of memory.
|
53 |
|
54 |
TODO: attach image
|
@@ -236,7 +209,7 @@ python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli dow
|
|
236 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|
237 |
The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
|
238 |
|
239 |
-
(TODO:
|
240 |
```Shell
|
241 |
PARAMS="executorch/examples/models/qwen3/4b_config.json"
|
242 |
python -m executorch.examples.models.llama.export_llama --model "qwen3-4b" --checkpoint "pytorch_model_converted.bin" --params "$PARAMS" -kv --use_sdpa_with_kv_cache -d fp32
|
|
|
21 |
|
22 |
# Running in a mobile app
|
23 |
(TODO: pte file name generation)
|
24 |
+
The [pte file](https://huggingface.co/jerryzh168/Qwen3-8B-INT8-INT4/blob/main/qwen3-4B-INT8-INT4-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
On iPhone 15 Pro, the model runs at (to be filled) tokens/sec and uses (to be filled) Mb of memory.
|
26 |
|
27 |
TODO: attach image
|
|
|
209 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|
210 |
The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
|
211 |
|
212 |
+
(TODO: pte file name, model config path, model name auto generation)
|
213 |
```Shell
|
214 |
PARAMS="executorch/examples/models/qwen3/4b_config.json"
|
215 |
python -m executorch.examples.models.llama.export_llama --model "qwen3-4b" --checkpoint "pytorch_model_converted.bin" --params "$PARAMS" -kv --use_sdpa_with_kv_cache -d fp32
|