feihu.hf
commited on
Commit
·
a68211a
1
Parent(s):
cba1e86
update README
Browse files
README.md
CHANGED
@@ -206,7 +206,14 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
206 |
|
207 |
#### Step 1: Update Configuration File
|
208 |
|
209 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
210 |
|
211 |
#### Step 2: Launch Model Server
|
212 |
|
@@ -226,7 +233,7 @@ Then launch the server with Dual Chunk Flash Attention enabled:
|
|
226 |
|
227 |
```bash
|
228 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
229 |
-
vllm serve
|
230 |
--tensor-parallel-size 8 \
|
231 |
--max-model-len 1010000 \
|
232 |
--enable-chunked-prefill \
|
@@ -262,7 +269,7 @@ Launch the server with DCA support:
|
|
262 |
|
263 |
```bash
|
264 |
python3 -m sglang.launch_server \
|
265 |
-
--model-path
|
266 |
--context-length 1010000 \
|
267 |
--mem-frac 0.75 \
|
268 |
--attention-backend dual_chunk_flash_attn \
|
|
|
206 |
|
207 |
#### Step 1: Update Configuration File
|
208 |
|
209 |
+
Download the model and replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
210 |
+
|
211 |
+
```bash
|
212 |
+
export MODELNAME=Qwen3-235B-A22B-Instruct-2507
|
213 |
+
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
|
214 |
+
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
|
215 |
+
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
|
216 |
+
```
|
217 |
|
218 |
#### Step 2: Launch Model Server
|
219 |
|
|
|
233 |
|
234 |
```bash
|
235 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
236 |
+
vllm serve ./Qwen3-235B-A22B-Instruct-2507 \
|
237 |
--tensor-parallel-size 8 \
|
238 |
--max-model-len 1010000 \
|
239 |
--enable-chunked-prefill \
|
|
|
269 |
|
270 |
```bash
|
271 |
python3 -m sglang.launch_server \
|
272 |
+
--model-path ./Qwen3-235B-A22B-Instruct-2507 \
|
273 |
--context-length 1010000 \
|
274 |
--mem-frac 0.75 \
|
275 |
--attention-backend dual_chunk_flash_attn \
|