Upload folder using huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
library_name: transformers
|
3 |
+
license: apache-2.0
|
4 |
+
license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
base_model: Qwen/Qwen3-30B-A3B
|
7 |
+
---
|
8 |
+
|
9 |
+
Int8 quant for optimized performance on Ampere.
|
10 |
+
|
11 |
+
# usage with sglang
|
12 |
+
|
13 |
+
Currently, upstream sglang doesn't load this quant correctly due to a few minor issues. Until upstream is fixed, a working fork is available at https://github.com/nytopop/sglang/tree/qwen-30b-a3b:
|
14 |
+
|
15 |
+
```shell
|
16 |
+
uv venv --python 3.12
|
17 |
+
|
18 |
+
# use patched sglang from git
|
19 |
+
uv pip install git+https://github.com/nytopop/sglang.git@qwen-30b-a3b#subdirectory=python[all] --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
|
20 |
+
|
21 |
+
# run
|
22 |
+
uv run python -m sglang.launch_server --model-path nytopop/Qwen3-30B-A3B.w8a8 --quantization w8a8_int8 --reasoning-parser qwen3
|
23 |
+
```
|
24 |
+
|
25 |
+
# creation
|
26 |
+
|
27 |
+
```python
|
28 |
+
from transformers import AutoModelForCausalLM
|
29 |
+
from llmcompressor import oneshot
|
30 |
+
from llmcompressor.modifiers.quantization import QuantizationModifier
|
31 |
+
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
|
32 |
+
|
33 |
+
model_id = "Qwen/Qwen3-30B-A3B"
|
34 |
+
model_out = model_id.split("/")[1] + ".w8a8"
|
35 |
+
|
36 |
+
device_map = calculate_offload_device_map(
|
37 |
+
model_id, reserve_for_hessians=False, num_gpus=1, torch_dtype="bfloat16"
|
38 |
+
)
|
39 |
+
|
40 |
+
for k, v in device_map.items():
|
41 |
+
if v == 'disk':
|
42 |
+
device_map[k] = 'cpu'
|
43 |
+
|
44 |
+
model = AutoModelForCausalLM.from_pretrained(
|
45 |
+
model_id,
|
46 |
+
device_map=device_map,
|
47 |
+
torch_dtype="bfloat16",
|
48 |
+
)
|
49 |
+
|
50 |
+
recipe = QuantizationModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*mlp.gate$"])
|
51 |
+
|
52 |
+
oneshot(model=model, recipe=recipe, output_dir=model_out)
|
53 |
+
```
|