Qwen
/

Qwen3-235B-A22B-Instruct-2507

Text Generation

Model card Files Files and versions

feihu.hf commited on Aug 7

Commit

a68211a

·

1 Parent(s): cba1e86

update README

Files changed (1) hide show

README.md +10 -3

README.md CHANGED Viewed

@@ -206,7 +206,14 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
 #### Step 1: Update Configuration File
-Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
 #### Step 2: Launch Model Server
@@ -226,7 +233,7 @@ Then launch the server with Dual Chunk Flash Attention enabled:
 ```bash
 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
-vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
   --tensor-parallel-size 8 \
   --max-model-len 1010000 \
   --enable-chunked-prefill \
@@ -262,7 +269,7 @@ Launch the server with DCA support:
 ```bash
 python3 -m sglang.launch_server \
-    --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
     --context-length 1010000 \
     --mem-frac 0.75 \
     --attention-backend dual_chunk_flash_attn \

 #### Step 1: Update Configuration File
+Download the model and replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
+```bash
+export MODELNAME=Qwen3-235B-A22B-Instruct-2507
+huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
+mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
+mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
+```
 #### Step 2: Launch Model Server
 ```bash
 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
+vllm serve ./Qwen3-235B-A22B-Instruct-2507 \
   --tensor-parallel-size 8 \
   --max-model-len 1010000 \
   --enable-chunked-prefill \
 ```bash
 python3 -m sglang.launch_server \
+    --model-path ./Qwen3-235B-A22B-Instruct-2507 \
     --context-length 1010000 \
     --mem-frac 0.75 \
     --attention-backend dual_chunk_flash_attn \