Qwen
/

Qwen3-235B-A22B-Instruct-2507

@@ -208,7 +208,7 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
-#### Step 2: Start Model Server
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
@@ -282,17 +282,17 @@ python3 -m sglang.launch_server \
 #### Troubleshooting:
-1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
     The VRAM reserved for the KV cache is insufficient.
-    - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
-    - SGLang: Consider reducing the ``context-length`` or increasing the ``tp``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
     The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
-3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
     The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.

 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
+#### Step 2: Launch Model Server
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
 #### Troubleshooting:
+1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."
     The VRAM reserved for the KV cache is insufficient.
+    - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size`` and ``gpu_memory_utilization``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
+    - SGLang: Consider reducing the ``context-length`` or increasing the ``tp`` and ``mem-frac``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
     The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
+3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."
     The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.