File size: 14,623 Bytes

---
language:
- en
- fr
- de
- es
- pt
- it
- ja
- ko
- ru
- zh
- ar
- fa
- id
- ms
- ne
- pl
- ro
- sr
- sv
- tr
- uk
- vi
- hi
- bn
license: apache-2.0
library_name: vllm
base_model:
- mistralai/Mistral-Small-3.1-24B-Instruct-2503
pipeline_tag: image-text-to-text
tags:
- neuralmagic
- redhat
- llmcompressor
- quantized
- int8
---

# Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8

## Model Overview
- **Model Architecture:** Mistral3ForConditionalGeneration
  - **Input:** Text / Image
  - **Output:** Text
- **Model Optimizations:**
  - **Activation quantization:** INT8
  - **Weight quantization:** INT8
- **Intended Use Cases:** It is ideal for:
  - Fast-response conversational agents.
  - Low-latency function calling.
  - Subject matter experts via fine-tuning.
  - Local inference for hobbyists and organizations handling sensitive data.
  - Programming and math reasoning.
  - Long document understanding.
  - Visual understanding.
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages not officially supported by the model.
- **Release Date:** 04/15/2025
- **Version:** 1.0
- **Model Developers:** Red Hat (Neural Magic)


### Model Optimizations

This model was obtained by quantizing activations and weights of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) to INT8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized.
Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.


## Deployment

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

```python
from vllm import LLM, SamplingParams
from transformers import AutoProcessor

model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
```

vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.

## Creation

<details>
  <summary>Creation details</summary>
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. 


  ```python
  from transformers import AutoProcessor
  from llmcompressor.modifiers.quantization import GPTQModifier
  from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
  from llmcompressor.transformers import oneshot
  from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration
  from datasets import load_dataset, interleave_datasets
  from PIL import Image
  import io
  
  # Load model
  model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
  model_name = model_stub.split("/")[-1]
  
  num_text_samples = 1024
  num_vision_samples = 1024
  max_seq_len = 8192
  
  processor = AutoProcessor.from_pretrained(model_stub)
  
  model = TraceableMistral3ForConditionalGeneration.from_pretrained(
      model_stub,
      device_map="auto",
      torch_dtype="auto",
  )

  # Text-only data subset
  def preprocess_text(example):
      input = {
          "text": processor.apply_chat_template(
              example["messages"],
              add_generation_prompt=False,
          ),
          "images": None,
      }
      tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
      tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
      tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
      return tokenized_input

  dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(range(num_text_samples))
  dst = dst.map(preprocess_text, remove_columns=dst.column_names)

  # Text + vision data subset
  def preprocess_vision(example):
      messages = []
      image = None
      for message in example["messages"]:
          message_content = []
          for content in message["content"]:
              if content["type"] == "text":
                  message_content.append({"type": "text", "text": content["text"]})
              else:
                  message_content.append({"type": "image"})
                  image = Image.open(io.BytesIO(content["image"]))

          messages.append(
              {
                  "role": message["role"],
                  "content": message_content,
              }
          )

      input = {
          "text": processor.apply_chat_template(
              messages,
              add_generation_prompt=False,
          ),
          "images": image,
      }
      tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
      tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
      tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
      return tokenized_input

  dsv = load_dataset("neuralmagic/calibration", name="VLM", split="train").select(range(num_vision_samples))
  dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)

  # Interleave subsets
  ds = interleave_datasets((dsv, dst))

  # Configure the quantization algorithm and scheme
  recipe = [
      SmoothQuantModifier(
        smoothing_strength=0.8,
        mappings=[
            [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
            [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
            [["re:.*down_proj"], "re:.*up_proj"],
        ],
      ),
      GPTQModifier(
          ignore=["language_model.lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
          sequential_targets=["MistralDecoderLayer"],
          dampening_frac=0.01,
          targets="Linear",
          scheme="W8A8",
      ),
  ]

  # Define data collator
  def data_collator(batch):
      import torch
      assert len(batch) == 1
      collated = {}
      for k, v in batch[0].items():
          if v is None:
              continue
          if k == "input_ids":
              collated[k] = torch.LongTensor(v)
          elif k == "pixel_values":
              collated[k] = torch.tensor(v, dtype=torch.bfloat16)
          else:
              collated[k] = torch.tensor(v)
      return collated


  # Apply quantization
  oneshot(
      model=model,
      dataset=ds, 
      recipe=recipe,
      max_seq_length=max_seq_len,
      data_collator=data_collator,
      num_calibration_samples=num_text_samples + num_vision_samples,
  )
  
  # Save to disk in compressed-tensors format
  save_path = model_name + "-quantized.w8a8"
  model.save_pretrained(save_path)
  processor.save_pretrained(save_path)
  print(f"Model and tokenizer saved to: {save_path}")
  ```
</details>
 


## Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (version 1), MMLU-pro, GPQA, HumanEval and MBPP.
Non-coding tasks were evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), whereas coding tasks were evaluated with a fork of [evalplus](https://github.com/neuralmagic/evalplus).
[vLLM](https://docs.vllm.ai/en/stable/) is used as the engine in all cases.

<details>
  <summary>Evaluation details</summary>

  **MMLU**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **ARC Challenge**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks arc_challenge \
    --num_fewshot 25 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **GSM8k**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks gsm8k \
    --num_fewshot 8 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **Hellaswag**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks hellaswag \
    --num_fewshot 10 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **Winogrande**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks winogrande \
    --num_fewshot 5 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **TruthfulQA**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks truthfulqa \
    --num_fewshot 0 \
    --apply_chat_template\
    --batch_size auto
  ```

  **MMLU-pro**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks mmlu_pro \
    --num_fewshot 5 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **MMMU**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_images=8,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks mmmu \
    --apply_chat_template\
    --batch_size auto
  ```

  **ChartQA**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_images=8,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks chartqa \
    --apply_chat_template\
    --batch_size auto
  ```

**Coding**

The commands below can be used for mbpp by simply replacing the dataset name.

*Generation*
```
python3 codegen/generate.py \
  --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 \
  --bs 16 \
  --temperature 0.2 \
  --n_samples 50 \
  --root "." \
  --dataset humaneval

```

*Sanitization*
```
python3 evalplus/sanitize.py \
  humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2
```

*Evaluation*
```
evalplus.evaluate \
  --dataset humaneval \
  --samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2-sanitized
```
</details>

### Accuracy

<table>
  <tr>
   <th>Category
   </th>
   <th>Benchmark
   </th>
   <th>Mistral-Small-3.1-24B-Instruct-2503
   </th>
   <th>Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8<br>(this model)
   </th>
   <th>Recovery
   </th>
  </tr>
  <tr>
   <td rowspan="7" ><strong>OpenLLM v1</strong>
   </td>
   <td>MMLU (5-shot)
   </td>
   <td>80.67
   </td>
   <td>80.40
   </td>
   <td>99.7%
   </td>
  </tr>
  <tr>
   <td>ARC Challenge (25-shot)
   </td>
   <td>72.78
   </td>
   <td>73.46
   </td>
   <td>100.9%
   </td>
  </tr>
  <tr>
   <td>GSM-8K (5-shot, strict-match)
   </td>
   <td>65.35
   </td>
   <td>70.58
   </td>
   <td>108.0%
   </td>
  </tr>
  <tr>
   <td>Hellaswag (10-shot)
   </td>
   <td>83.70
   </td>
   <td>82.26
   </td>
   <td>98.3%
   </td>
  </tr>
  <tr>
   <td>Winogrande (5-shot)
   </td>
   <td>83.74
   </td>
   <td>80.90
   </td>
   <td>96.6%
   </td>
  </tr>
  <tr>
   <td>TruthfulQA (0-shot, mc2)
   </td>
   <td>70.62
   </td>
   <td>69.15
   </td>
   <td>97.9%
   </td>
  </tr>
  <tr>
   <td><strong>Average</strong>
   </td>
   <td><strong>76.14</strong>
   </td>
   <td><strong>76.13</strong>
   </td>
   <td><strong>100.0%</strong>
   </td>
  </tr>
  <tr>
   <td rowspan="3" ><strong></strong>
   </td>
   <td>MMLU-Pro (5-shot)
   </td>
   <td>67.25
   </td>
   <td>66.54
   </td>
   <td>98.9%
   </td>
  </tr>
  <tr>
   <td>GPQA CoT main (5-shot)
   </td>
   <td>42.63
   </td>
   <td>44.64
   </td>
   <td>104.7%
   </td>
  </tr>
  <tr>
   <td>GPQA CoT diamond (5-shot)
   </td>
   <td>45.96
   </td>
   <td>41.92
   </td>
   <td>91.2%
   </td>
  </tr>
  <tr>
   <td rowspan="4" ><strong>Coding</strong>
   </td>
   <td>HumanEval pass@1
   </td>
   <td>84.70
   </td>
   <td>84.20
   </td>
   <td>99.4%
   </td>
  </tr>
  <tr>
   <td>HumanEval+ pass@1
   </td>
   <td>79.50
   </td>
   <td>81.00
   </td>
   <td>101.9%
   </td>
  </tr>
  <tr>
   <td>MBPP pass@1
   </td>
   <td>71.10
   </td>
   <td>72.10
   </td>
   <td>101.4%
   </td>
  </tr>
  <tr>
   <td>MBPP+ pass@1
   </td>
   <td>60.60
   </td>
   <td>62.10
   </td>
   <td>100.7%
   </td>
  </tr>
  <tr>
   <td rowspan="2" ><strong>Vision</strong>
   </td>
   <td>MMMU (0-shot)
   </td>
   <td>52.11
   </td>
   <td>53.11
   </td>
   <td>101.9%
   </td>
  </tr>
  <tr>
   <td>ChartQA (0-shot)
   </td>
   <td>81.36
   </td>
   <td>82.36
   </td>
   <td>101.2%
   </td>
  </tr>
</table>