Qwen
/

Qwen3-32B-FP8

Text Generation

text-generation-inference

Model card Files Files and versions

Remove vLLM FP8 Limitation

#3

by simon-mo - opened Apr 29

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

Files changed (1) hide show

README.md +0 -23

README.md CHANGED Viewed

@@ -113,29 +113,6 @@ You can use the Qwen3-32B-FP8 model with serveral inference frameworks, includin
 However, please pay attention to the following known issues:
 - `transformers`:
     - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
-- vLLM:
-    - there are currently compatibility issues with `vllm`. For a quick fix, you should make the following changes to `vllm/vllm/model_executor/layers/linear.py`:
-        ```python
-        # these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
-        ...
-        shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
-        shard_size = self._get_shard_size_mapping(loaded_shard_id)
-        # add the following code
-        if isinstance(param, BlockQuantScaleParameter):
-            weight_block_size = self.quant_method.quant_config.weight_block_size
-            block_n, _ = weight_block_size[0], weight_block_size[1]
-            shard_offset = (shard_offset + block_n - 1) // block_n
-            shard_size = (shard_size + block_n - 1) // block_n
-        # end of the modification
-        param.load_qkv_weight(loaded_weight=loaded_weight,
-                                num_heads=self.num_kv_head_replicas,
-                                shard_id=loaded_shard_id,
-                                shard_offset=shard_offset,
-                                shard_size=shard_size)
-        ...
-        ```
 ## Switching Between Thinking and Non-Thinking Mode

 However, please pay attention to the following known issues:
 - `transformers`:
     - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
 ## Switching Between Thinking and Non-Thinking Mode