feihu.hf commited on
Commit
cba1e86
·
1 Parent(s): 1c8249c

update README

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -208,7 +208,7 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
208
 
209
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
210
 
211
- #### Step 2: Start Model Server
212
 
213
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
214
 
@@ -282,17 +282,17 @@ python3 -m sglang.launch_server \
282
 
283
  #### Troubleshooting:
284
 
285
- 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
286
 
287
  The VRAM reserved for the KV cache is insufficient.
288
- - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
289
- - SGLang: Consider reducing the ``context-length`` or increasing the ``tp``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
290
 
291
  2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
292
 
293
  The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
294
 
295
- 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
296
 
297
  The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
298
 
 
208
 
209
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
210
 
211
+ #### Step 2: Launch Model Server
212
 
213
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
214
 
 
282
 
283
  #### Troubleshooting:
284
 
285
+ 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."
286
 
287
  The VRAM reserved for the KV cache is insufficient.
288
+ - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size`` and ``gpu_memory_utilization``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
289
+ - SGLang: Consider reducing the ``context-length`` or increasing the ``tp`` and ``mem-frac``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
290
 
291
  2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
292
 
293
  The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
294
 
295
+ 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."
296
 
297
  The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
298