--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: mit base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --- # DeepSeek-R1-Distill-Qwen-32B-NVFP4 ## Model Overview - **Model Architecture:** DeepSeek-R1-Distill-Qwen-32B - **Input:** Text / Image - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Release Date:** 7/30/25 - **Version:** 1.0 - **Model Developers:** RedHatAI This model is a quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

Model Usage Code

```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4" number_gpus = 2 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ```

vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.

Model Creation Code

```python ```

## Evaluation This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).

Category	Metric	DeepSeek-R1-Distill-Qwen-32B	DeepSeek-R1-Distill-Qwen-32B-NVFP4	Recovery (%)
OpenLLM V1	ARC Challenge	67.66	64.25	94.94%
	GSM8K	83.02	84.84	102.19%
	Hellaswag	83.79	83.28	99.39%
	MMLU	81.25	80.79	99.43%
	TruthfulQA-mc2	58.37	57.50	98.51%
	Winogrande	75.77	76.40	100.83%
	Average	74.98	74.51	99.38%
OpenLLM V2	MMLU-Pro			%
	IFEval			%
	BBH			%
	Math-Hard			%
	GPQA			%
	MuSR			%
	Average			%
Reasoning	Math 500	95.09	95.60	100.54%
	GPQA (diamond)	64.05	61.11	95.41%
	AIME25	69.75 (AIME24)	53.33	76.45%
	LCB: Code Generation	–	54.29	–
Coding	HumanEval Instruct pass@1	–	–	–
	HumanEval 64 Instruct pass@2	–	–	–
	HumanEval 64 Instruct pass@8	–	–	–
	HumanEval 64 Instruct pass@16	–	–	–
	HumanEval 64 Instruct pass@32	–	–	–
	HumanEval 64 Instruct pass@64	–	–	–

### Reproduction The results were obtained using the following commands:

Model Evaluation Commands

#### OpenLLM v1 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks openllm \ --batch_size auto ``` #### OpenLLM v2 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=15000,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks leaderboard \ --batch_size auto ``` #### HumanEval and HumanEval_64 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_instruct \ --batch_size auto lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_64_instruct \ --batch_size auto ```