jerryzh168
/

Qwen3-8B-INT8-INT4

@@ -21,34 +21,7 @@ language:
 # Running in a mobile app
 (TODO: pte file name generation)
-The [pte file](https://huggingface.co/Qwen3ForCausalLM(
-  (model): Qwen3Model(
-    (embed_tokens): Embedding(151936, 4096)
-    (layers): ModuleList(
-      (0-35): 36 x Qwen3DecoderLayer(
-        (self_attn): Qwen3Attention(
-          (q_proj): Linear(in_features=4096, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
-          (k_proj): Linear(in_features=4096, out_features=1024, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([1024, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
-          (v_proj): Linear(in_features=4096, out_features=1024, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([1024, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
-          (o_proj): Linear(in_features=4096, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
-          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
-          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
-        )
-        (mlp): Qwen3MLP(
-          (gate_proj): Linear(in_features=4096, out_features=12288, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([12288, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
-          (up_proj): Linear(in_features=4096, out_features=12288, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([12288, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
-          (down_proj): Linear(in_features=12288, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 12288]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
-          (act_fn): SiLU()
-        )
-        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
-        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
-      )
-    )
-    (norm): Qwen3RMSNorm((4096,), eps=1e-06)
-    (rotary_emb): Qwen3RotaryEmbedding()
-  )
-  (lm_head): Linear(in_features=4096, out_features=151936, bias=False)
-)/blob/main/qwen3-4B-INT8-INT4-1024-cxt.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at (to be filled) tokens/sec and uses (to be filled) Mb of memory.
 TODO: attach image
@@ -236,7 +209,7 @@ python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli dow
 Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
 The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
-(TODO: these needs to be updated)
 ```Shell
 PARAMS="executorch/examples/models/qwen3/4b_config.json"
 python -m executorch.examples.models.llama.export_llama   --model "qwen3-4b"   --checkpoint "pytorch_model_converted.bin"   --params "$PARAMS"   -kv   --use_sdpa_with_kv_cache   -d fp32

 # Running in a mobile app
 (TODO: pte file name generation)
+The [pte file](https://huggingface.co/jerryzh168/Qwen3-8B-INT8-INT4/blob/main/qwen3-4B-INT8-INT4-1024-cxt.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at (to be filled) tokens/sec and uses (to be filled) Mb of memory.
 TODO: attach image
 Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
 The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
+(TODO: pte file name, model config path, model name auto generation)
 ```Shell
 PARAMS="executorch/examples/models/qwen3/4b_config.json"
 python -m executorch.examples.models.llama.export_llama   --model "qwen3-4b"   --checkpoint "pytorch_model_converted.bin"   --params "$PARAMS"   -kv   --use_sdpa_with_kv_cache   -d fp32