jerryzh168 commited on
Commit
b5ebbbe
·
verified ·
1 Parent(s): 9f1a340

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +2 -29
README.md CHANGED
@@ -21,34 +21,7 @@ language:
21
 
22
  # Running in a mobile app
23
  (TODO: pte file name generation)
24
- The [pte file](https://huggingface.co/Qwen3ForCausalLM(
25
- (model): Qwen3Model(
26
- (embed_tokens): Embedding(151936, 4096)
27
- (layers): ModuleList(
28
- (0-35): 36 x Qwen3DecoderLayer(
29
- (self_attn): Qwen3Attention(
30
- (q_proj): Linear(in_features=4096, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
31
- (k_proj): Linear(in_features=4096, out_features=1024, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([1024, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
32
- (v_proj): Linear(in_features=4096, out_features=1024, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([1024, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
33
- (o_proj): Linear(in_features=4096, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
34
- (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
35
- (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
36
- )
37
- (mlp): Qwen3MLP(
38
- (gate_proj): Linear(in_features=4096, out_features=12288, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([12288, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
39
- (up_proj): Linear(in_features=4096, out_features=12288, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([12288, 4096]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
40
- (down_proj): Linear(in_features=12288, out_features=4096, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f940592e3b0>, weight=AffineQuantizedTensor(shape=torch.Size([4096, 12288]), block_size=(1, 32), device=cuda:0, _layout=QDQLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
41
- (act_fn): SiLU()
42
- )
43
- (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
44
- (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
45
- )
46
- )
47
- (norm): Qwen3RMSNorm((4096,), eps=1e-06)
48
- (rotary_emb): Qwen3RotaryEmbedding()
49
- )
50
- (lm_head): Linear(in_features=4096, out_features=151936, bias=False)
51
- )/blob/main/qwen3-4B-INT8-INT4-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
52
  On iPhone 15 Pro, the model runs at (to be filled) tokens/sec and uses (to be filled) Mb of memory.
53
 
54
  TODO: attach image
@@ -236,7 +209,7 @@ python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli dow
236
  Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
237
  The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
238
 
239
- (TODO: these needs to be updated)
240
  ```Shell
241
  PARAMS="executorch/examples/models/qwen3/4b_config.json"
242
  python -m executorch.examples.models.llama.export_llama --model "qwen3-4b" --checkpoint "pytorch_model_converted.bin" --params "$PARAMS" -kv --use_sdpa_with_kv_cache -d fp32
 
21
 
22
  # Running in a mobile app
23
  (TODO: pte file name generation)
24
+ The [pte file](https://huggingface.co/jerryzh168/Qwen3-8B-INT8-INT4/blob/main/qwen3-4B-INT8-INT4-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  On iPhone 15 Pro, the model runs at (to be filled) tokens/sec and uses (to be filled) Mb of memory.
26
 
27
  TODO: attach image
 
209
  Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
210
  The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
211
 
212
+ (TODO: pte file name, model config path, model name auto generation)
213
  ```Shell
214
  PARAMS="executorch/examples/models/qwen3/4b_config.json"
215
  python -m executorch.examples.models.llama.export_llama --model "qwen3-4b" --checkpoint "pytorch_model_converted.bin" --params "$PARAMS" -kv --use_sdpa_with_kv_cache -d fp32