Qwen/Qwen3-32B · Providing a GPTQ version

14 days ago

Congratulations on the recent release of your new LLM! I’m very interested in trying it out and was wondering if you could kindly consider providing a GPTQ version of the model. It would be incredibly helpful for those of us looking to run it efficiently on local hardware.

Thank you for your great work and contributions to the open-source community!

lhan877222

14 days ago

also need awq vesion

tassadar81

14 days ago

maybe with int8 quants please!

sandman4

12 days ago

•

edited 12 days ago

I created GPTQ 4bit version: https://huggingface.co/sandman4/Qwen3-32B-GPTQ-4bit

tassadar81

11 days ago

I created GPTQ 4bit version: https://huggingface.co/sandman4/Qwen3-32B-GPTQ-4bit

could you provide an int8 version too please?

sandman4

11 days ago

here you are https://huggingface.co/sandman4/Qwen3-32B-GPTQ-8bit

dazipe

9 days ago

•

edited 9 days ago

Hey Sandman4,

Any chance you can release the 8bit version but only with FP16. The AMD cards MI60, MI100 do not work with the BF16.

sandman4

8 days ago

How about this? https://huggingface.co/sandman4/Qwen3-32B-GPTQ-8bit-float16
I found the fact that with float16 dtype, "dump_percent exceeds 1.0" error occurs when calibrating with "en/c4-train.00001-of-01024.json.gz", but not with "en/c4-train.00002-of-01024.json.gz".

dazipe

8 days ago

Hi Sanndman4,

Thank you very much for providing it.
However it does not produce any sensible output. I tried it with the latest vllm with 2 different FA implementations. But all I get is gibberish.
Did you try it? Was it working for you?

sandman4

8 days ago

Hmm... I don't have AMD GPUs. I just tried with RTX3090 x2. It seems well.

(venv) $ vllm serve .../Qwen--Qwen3-32B-gptq-8bit-float16 --served-model-name qwen3-32b --tensor-parallel 2 --max-model-len 4096 --max-num-seqs 1
INFO 05-05 13:12:22 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                                                      
INFO 05-05 13:12:27 [api_server.py:1043] vLLM API server version 0.8.5                                                                                                                                                                                                           
INFO 05-05 13:12:27 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='.../Qwen--Qwen3-32B-gptq-8bit-float16', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='.../Qwen--Qwen3-32B-gptq-8bit-float16', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['qwen3-32b'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=1, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x723271d5df80>)
INFO 05-05 13:12:35 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.                                                                                                                          
INFO 05-05 13:12:35 [gptq_marlin.py:143] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.                                                                                                                                                       
INFO 05-05 13:12:35 [config.py:1770] Defaulting to use mp for distributed inference                                                                                                                                                                                              
INFO 05-05 13:12:35 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                                                                                                
INFO 05-05 13:12:40 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                                                      
INFO 05-05 13:12:43 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='.../Qwen--Qwen3-32B-gptq-8bit-float16', speculative_config=None, tokenizer='.../Qwen--Qwen3-32B-gptq-8bit-float16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=qwen3-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
...

$ curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b",
    "messages": [
      {
        "role": "user",
        "content": "Why sky is blue?"
      }
    ],
    "max_tokens": 2047
  }' \
  http://...:8000/v1/chat/completions
  {"id":"chatcmpl-9827dd0448dc41ce9feea277ee23f281","object":"chat.completion","created":1746451287,"model":"qwen3-32b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, so I need to figure out why the sky is blue. I remember learning something about light and the atmosphere, but I'm a bit fuzzy on the details. Let me start by recalling what I know. The sun emits white light, which is made up of different colors. Each color has a different wavelength. Blue light has a shorter wavelength compared to red or yellow. But how does that relate to the sky's color?\n\nMaybe it has to do with how light interacts with the atmosphere. I think there's a term like Rayleigh scattering involved. Rayleigh scattering is when light is scattered by particles much smaller than the wavelength of the light. The atmosphere has molecules of nitrogen and oxygen, which are much smaller than the wavelengths of visible light. So shorter wavelengths (blue/violet) are scattered more than the longer ones (red/orange). \n\nWait, but if blue is scattered more, why isn't the sky violet? Violet has an even shorter wavelength. Maybe the sun emits less violet light, or our eyes are more sensitive to blue. Also, the atmosphere might absorb some violet light. I think our eyes have cones that are more sensitive to blue, so we perceive the sky as blue instead of violet.\n\nAnother thing to consider is the direction of the scattered light. During the day, when the sun is high, the blue light is scattered in all directions, making the sky appear blue from all angles. But during sunrise or sunset, the light has to pass through more of the atmosphere, so the shorter wavelengths are scattered out of the line of sight, leaving the longer wavelengths (reds and oranges) to dominate. That's why the sky changes color during those times.\n\nBut wait, what about other particles in the atmosphere, like dust or water droplets? Those are bigger than the molecules, so they cause different types of scattering, like Mie scattering, which scatters all wavelengths more equally, leading to white or gray skies when there's a lot of pollution or clouds. So the main reason the sky is blue is due to Rayleigh scattering by the gas molecules in the atmosphere.\n\nLet me check if I missed anything. The key points are: different wavelengths of light, Rayleigh scattering, the role of atmospheric molecules, human eye sensitivity, and the scattering direction. Also, why it's not violet. Maybe the sun's spectrum peaks in the green, but blue is scattered more. Also, some of the violet is absorbed by the upper atmosphere. \n\nSo putting it all together, the blue color of the sky is due to Rayleigh scattering, where shorter wavelengths (blue) are scattered more by the atmosphere, and our eyes are more sensitive to blue than violet, making the sky appear blue.\n</think>\n\nThe sky appears blue primarily due to a phenomenon called **Rayleigh scattering**, which explains how light interacts with particles in the Earth's atmosphere. Here's a concise breakdown of the key reasons:\n\n1. **Sunlight and Wavelengths**:  \n   Sunlight (white light) consists of a spectrum of colors, each with different wavelengths. Blue and violet light have shorter wavelengths, while red and orange have longer ones.\n\n2. **Rayleigh Scattering**:  \n   When sunlight enters Earth's atmosphere, it collides with molecules like nitrogen and oxygen. These molecules are much smaller than the wavelength of visible light. Rayleigh scattering states that **shorter wavelengths (blue/violet) are scattered more efficiently** than longer ones. This means blue light is scattered in all directions more than other colors.\n\n3. **Why Not Violet?**:  \n   Although violet light is scattered even more than blue, the sky doesn’t appear violet for two reasons:  \n   - **Solar Spectrum**: The sun emits less violet light compared to blue.  \n   - **Human Eye Sensitivity**: Our eyes are more sensitive to blue light and less sensitive to violet. Additionally, some violet light is absorbed by the upper atmosphere.\n\n4. **Direction of Scattering**:  \n   During the day, when the sun is overhead, the scattered blue light fills the sky from all directions, making it appear blue. At sunrise or sunset, sunlight passes through a thicker layer of atmosphere, scattering out blue light and leaving longer wavelengths (reds and oranges) to dominate.\n\n5. **Other Scattering Types**:  \n   Larger particles (e.g., dust, water droplets) cause **Mie scattering**, which scatters all wavelengths more evenly, leading to white or gray skies (e.g., during cloudy or polluted conditions).\n\n**In Summary**:  \nThe sky is blue because Earth's atmosphere scatters shorter (blue) wavelengths of sunlight more effectively than longer wavelengths, combined with human eye sensitivity. This scattering is governed by Rayleigh scattering, a principle that depends on the size of atmospheric particles and the wavelength of light.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":13,"total_tokens":984,"completion_tokens":971,"prompt_tokens_details":null},"prompt_logprobs":null}

sandman4

8 days ago

•

edited 8 days ago

@dazipe Have you already checked the AMD official article? https://rocm.blogs.amd.com/artificial-intelligence/qwen3-day0-amd/README.html
VLLM_USE_TRITON_FLASH_ATTN=0 might be needed.

dazipe

8 days ago

Yes, I did tried it both with the VLLM_USE_TRITON_FLASH_ATTN=0 and without it. Makes no difference.
Officially AMD supports MI210 and newer. Those cards support BF16. However I have older MI100 and the vLLM only supports FP16 with them. So official AMD Docker images do not work for me. I have to build everything.

I have the FP16 quants for older LLMs (Qwen2.5-72B, Lllama3.3...) an they all work perfectly fine.

I can start the vLLM with the quant you have provided and it responds to the requests however it produces incoherent output.
I had this problems before with the older LLMs and I believe it was related to the quality of the quants. Or maybe the software is not just there yet.

I did try your script to see if I could make the quants myself... No way.. I only have 64GB of VRAM total and runs out of memory before it could do anything.

sandman4

7 days ago

RTX3090 has only 24GB VRAM. So I used 48GB VRAM totally. Your 64GB VRAM sounds enough to process quantization. ROCm support may be not enough than CUDA tho.
I think you can also try decrease the batch_size (2 -> 1) and decrease the dataset seq len (1024 -> 512 or 256) to avoid OOM.