--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: mit base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --- # DeepSeek-R1-Distill-Qwen-32B-NVFP4 ## Model Overview - **Model Architecture:** DeepSeek-R1-Distill-Qwen-32B - **Input:** Text / Image - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Release Date:** 7/30/25 - **Version:** 1.0 - **Model Developers:** RedHatAI This model is a quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
Model Usage Code ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4" number_gpus = 2 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ```
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.
Model Creation Code ```python ```
## Evaluation This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
Category Metric DeepSeek-R1-Distill-Qwen-32B DeepSeek-R1-Distill-Qwen-32B-NVFP4 Recovery (%)
OpenLLM V1 ARC Challenge 67.66 64.25 94.94%
GSM8K 83.02 84.84 102.19%
Hellaswag 83.79 83.28 99.39%
MMLU 81.25 80.79 99.43%
TruthfulQA-mc2 58.37 57.50 98.51%
Winogrande 75.77 76.40 100.83%
Average 74.98 74.51 99.38%
OpenLLM V2 MMLU-Pro %
IFEval %
BBH %
Math-Hard %
GPQA %
MuSR %
Average %
Reasoning Math 500 95.09 95.60 100.54%
GPQA (diamond) 64.05 61.11 95.41%
AIME25 69.75 (AIME24) 53.33 76.45%
LCB: Code Generation 54.29
Coding HumanEval Instruct pass@1
HumanEval 64 Instruct pass@2
HumanEval 64 Instruct pass@8
HumanEval 64 Instruct pass@16
HumanEval 64 Instruct pass@32
HumanEval 64 Instruct pass@64
### Reproduction The results were obtained using the following commands:
Model Evaluation Commands #### OpenLLM v1 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks openllm \ --batch_size auto ``` #### OpenLLM v2 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=15000,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks leaderboard \ --batch_size auto ``` #### HumanEval and HumanEval_64 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_instruct \ --batch_size auto lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_64_instruct \ --batch_size auto ```