Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ base_model:
|
|
13 |
pipeline_tag: text-generation
|
14 |
---
|
15 |
|
16 |
-
[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 47% VRAM reduction, around
|
17 |
|
18 |
# Inference with vLLM
|
19 |
```Shell
|
@@ -233,8 +233,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
233 |
| Benchmark (Tested on H100) | | |
|
234 |
|----------------------------------|----------------|-------------------------------|
|
235 |
| | Qwen3-32B | Qwen3-32B-float8dq |
|
236 |
-
| latency (batch_size=1) | 9.1s | 5.77s (
|
237 |
-
| latency (batch_size=128) | 12.45s | 8.40s (
|
238 |
|
239 |
<details>
|
240 |
<summary> Reproduce latency benchmarks </summary>
|
|
|
13 |
pipeline_tag: text-generation
|
14 |
---
|
15 |
|
16 |
+
[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 47% VRAM reduction, around 1.5x speedup and little to no accuracy impact on H100.
|
17 |
|
18 |
# Inference with vLLM
|
19 |
```Shell
|
|
|
233 |
| Benchmark (Tested on H100) | | |
|
234 |
|----------------------------------|----------------|-------------------------------|
|
235 |
| | Qwen3-32B | Qwen3-32B-float8dq |
|
236 |
+
| latency (batch_size=1) | 9.1s | 5.77s (1.58x speedup) |
|
237 |
+
| latency (batch_size=128) | 12.45s | 8.40s (1.48x speedup) |
|
238 |
|
239 |
<details>
|
240 |
<summary> Reproduce latency benchmarks </summary>
|