feihu.hf
commited on
Commit
·
cba1e86
1
Parent(s):
1c8249c
update README
Browse files
README.md
CHANGED
@@ -208,7 +208,7 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
208 |
|
209 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
210 |
|
211 |
-
#### Step 2:
|
212 |
|
213 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
214 |
|
@@ -282,17 +282,17 @@ python3 -m sglang.launch_server \
|
|
282 |
|
283 |
#### Troubleshooting:
|
284 |
|
285 |
-
1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
|
286 |
|
287 |
The VRAM reserved for the KV cache is insufficient.
|
288 |
-
- vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
|
289 |
-
- SGLang: Consider reducing the ``context-length`` or increasing the ``tp``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
|
290 |
|
291 |
2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
|
292 |
|
293 |
The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
|
294 |
|
295 |
-
3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
|
296 |
|
297 |
The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
|
298 |
|
|
|
208 |
|
209 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
210 |
|
211 |
+
#### Step 2: Launch Model Server
|
212 |
|
213 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
214 |
|
|
|
282 |
|
283 |
#### Troubleshooting:
|
284 |
|
285 |
+
1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."
|
286 |
|
287 |
The VRAM reserved for the KV cache is insufficient.
|
288 |
+
- vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size`` and ``gpu_memory_utilization``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
|
289 |
+
- SGLang: Consider reducing the ``context-length`` or increasing the ``tp`` and ``mem-frac``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
|
290 |
|
291 |
2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
|
292 |
|
293 |
The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
|
294 |
|
295 |
+
3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."
|
296 |
|
297 |
The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
|
298 |
|