nm-testing
/

DeepSeek-R1-Distill-Qwen-32B-NVFP4

+---
+tags:
+- fp4
+- vllm
+language:
+- en
+- de
+- fr
+- it
+- pt
+- hi
+- es
+- th
+pipeline_tag: text-generation
+license: mit
+base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
+---
+# DeepSeek-R1-Distill-Qwen-32B-NVFP4
+## Model Overview
+- **Model Architecture:** DeepSeek-R1-Distill-Qwen-32B
+  - **Input:** Text / Image
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP4
+  - **Activation quantization:** FP4
+- **Release Date:** 7/30/25
+- **Version:** 1.0
+- **Model Developers:** RedHatAI
+This model is a quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B).
+It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.
+### Model Optimizations
+This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) to FP4 data type, ready for inference with vLLM>=0.9.1
+This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
+Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
+## Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
+<details>
+<summary>Model Usage Code</summary>
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_id = "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4"
+number_gpus = 2
+sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+messages = [
+    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
+    {"role": "user", "content": "Who are you?"},
+]
+prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
+outputs = llm.generate(prompts, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+</details>
+vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+## Creation
+This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.
+<details>
+<summary>Model Creation Code</summary>
+```python
+```
+</details>
+## Evaluation
+This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
+<table>
+  <thead>
+    <tr>
+      <th>Category</th>
+      <th>Metric</th>
+      <th>DeepSeek-R1-Distill-Qwen-32B</th>
+      <th>DeepSeek-R1-Distill-Qwen-32B-NVFP4</th>
+      <th>Recovery (%)</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td rowspan="7"><b>OpenLLM V1</b></td>
+      <td>ARC Challenge</td>
+      <td>67.66</td>
+      <td>64.25</td>
+      <td>94.94%</td>
+    </tr>
+    <tr>
+      <td>GSM8K</td>
+      <td>83.02</td>
+      <td>84.84</td>
+      <td>102.19%</td>
+    </tr>
+    <tr>
+      <td>Hellaswag</td>
+      <td>83.79</td>
+      <td>83.28</td>
+      <td>99.39%</td>
+    </tr>
+    <tr>
+      <td>MMLU</td>
+      <td>81.25</td>
+      <td>80.79</td>
+      <td>99.43%</td>
+    </tr>
+    <tr>
+      <td>TruthfulQA-mc2</td>
+      <td>58.37</td>
+      <td>57.50</td>
+      <td>98.51%</td>
+    </tr>
+    <tr>
+      <td>Winogrande</td>
+      <td>75.77</td>
+      <td>76.40</td>
+      <td>100.83%</td>
+    </tr>
+    <tr>
+      <td><b>Average</b></td>
+      <td><b>74.98</b></td>
+      <td><b>74.51</b></td>
+      <td><b>99.38%</b></td>
+    </tr>
+    <tr>
+      <td rowspan="7"><b>OpenLLM V2</b></td>
+      <td>MMLU-Pro</td>
+      <td></td>
+      <td></td>
+      <td>%</td>
+    </tr>
+    <tr>
+      <td>IFEval</td>
+      <td></td>
+      <td></td>
+      <td>%</td>
+    </tr>
+    <tr>
+      <td>BBH</td>
+      <td></td>
+      <td></td>
+      <td>%</td>
+    </tr>
+    <tr>
+      <td>Math-Hard</td>
+      <td></td>
+      <td></td>
+      <td>%</td>
+    </tr>
+    <tr>
+      <td>GPQA</td>
+      <td></td>
+      <td></td>
+      <td>%</td>
+    </tr>
+    <tr>
+      <td>MuSR</td>
+      <td></td>
+      <td></td>
+      <td>%</td>
+    </tr>
+    <tr>
+      <td><b>Average</b></td>
+      <td><b></b></td>
+      <td><b></b></td>
+      <td><b>%</b></td>
+    </tr>
+    <tr>
+      <td rowspan="4"><b>Reasoning</b></td>
+      <td>Math 500</td>
+      <td>95.09</td>
+      <td>95.60</td>
+      <td>100.54%</td>
+    </tr>
+    <tr>
+      <td>GPQA (diamond)</td>
+      <td>64.05</td>
+      <td>61.11</td>
+      <td>95.41%</td>
+    </tr>
+    <tr>
+      <td>AIME25</td>
+      <td>69.75 (AIME24)</td>
+      <td>53.33</td>
+      <td>76.45%</td>
+    </tr>
+    <tr>
+      <td>LCB: Code Generation</td>
+      <td>–</td>
+      <td>54.29</td>
+      <td>–</td>
+    </tr>
+    <tr>
+      <td rowspan="6"><b>Coding</b></td>
+      <td>HumanEval Instruct pass@1</td>
+      <td>–</td>
+      <td>–</td>
+      <td>–</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@2</td>
+      <td>–</td>
+      <td>–</td>
+      <td>–</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@8</td>
+      <td>–</td>
+      <td>–</td>
+      <td>–</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@16</td>
+      <td>–</td>
+      <td>–</td>
+      <td>–</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@32</td>
+      <td>–</td>
+      <td>–</td>
+      <td>–</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@64</td>
+      <td>–</td>
+      <td>–</td>
+      <td>–</td>
+    </tr>
+  </tbody>
+</table>
+### Reproduction
+The results were obtained using the following commands:
+<details>
+<summary>Model Evaluation Commands</summary>
+#### OpenLLM v1
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks openllm \
+  --batch_size auto
+```
+#### OpenLLM v2
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=15000,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks leaderboard \
+  --batch_size auto
+```
+#### HumanEval and HumanEval_64
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks humaneval_instruct \
+  --batch_size auto
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks humaneval_64_instruct \
+  --batch_size auto
+```
+</details>