---
language:
- en
- fr
- de
- es
- pt
- it
- ja
- ko
- ru
- zh
- ar
- fa
- id
- ms
- ne
- pl
- ro
- sr
- sv
- tr
- uk
- vi
- hi
- bn
license: apache-2.0
library_name: vllm
base_model:
- mistralai/Mistral-Small-3.1-24B-Instruct-2503
pipeline_tag: image-text-to-text
tags:
- neuralmagic
- redhat
- llmcompressor
- quantized
- int4
---

# Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8

## Model Overview
- **Model Architecture:** Mistral3ForConditionalGeneration
  - **Input:** Text / Image
  - **Output:** Text
- **Model Optimizations:**
  - **Activation quantization:** INT8
  - **Weight quantization:** INT8
- **Intended Use Cases:** It is ideal for:
  - Fast-response conversational agents.
  - Low-latency function calling.
  - Subject matter experts via fine-tuning.
  - Local inference for hobbyists and organizations handling sensitive data.
  - Programming and math reasoning.
  - Long document understanding.
  - Visual understanding.
- **Out-of-scope:** This model is not specifically designed or evaluated for all downstream purposes, thus:
  1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
  2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.
  3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
- **Release Date:** 04/15/2025
- **Version:** 1.0
- **Model Developers:** RedHat (Neural Magic)


### Model Optimizations

This model was obtained by quantizing activations and weights of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) to INT8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized.
Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.


## Deployment

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
```

vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.

## Creation

<details>
  <summary>Creation details</summary>
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. 


  ```python
  from transformers import AutoProcessor
  from llmcompressor.modifiers.quantization import GPTQModifier
  from llmcompressor.transformers import oneshot
  from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration
  from PIL import Image
  import io
  
  # Load model
  model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
  model_name = model_stub.split("/")[-1]
  
  num_text_samples = 1024
  num_vison_samples = 1024
  max_seq_len = 8192
  
  processor = AutoProcessor.from_pretrained(model_stub)
  
  model = TraceableMistral3ForConditionalGeneration.from_pretrained(
      model_stub,
      device_map="auto",
      torch_dtype="auto",
  )

  # Text-only data subset
  def preprocess_text(example):
      input = {
          "text": processor.apply_chat_template(
              example["messages"],
              add_generation_prompt=False,
          ),
          "images" = None,
      }
      tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
      tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
      tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)

  dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(num_text_samples)
  dst = dst.map(preprocess_text, remove_columns=dst.column_names)

  # Text + vision data subset
  def preprocess_vision(example):
      messages = []
      image = None
      for message in example["messages"]:
          message_content = []
          for content in message["content"]
              if content["type"] == "text":
                  message_content = {"type": "text", "text": content["text"]}
              else:
                  message_content = {"type": "image"}}
                  image = Image.open(io.BytesIO(content["image"]))

          messages.append(
              {
                  "role": message["role"],
                  "content": message_content,
              }
          )

      input = {
          "text": processor.apply_chat_template(
              messages,
              add_generation_prompt=False,
          ),
          "images" = image,
      }
      tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
      tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
      tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)

  dsv = load_dataset("neuralmagic/calibration", name="VLLM", split="train").select(num_vision_samples)
  dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)

  # Interleave subsets
  ds = interleave_datasets((dsv, dst))

  # Configure the quantization algorithm and scheme
  recipe = [
      SmoothQuantModifier(smoothing_strength=0.8),
      GPTQModifier(
          ignore=["language_model.lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"]
          sequential_targets=["MistralDecoderLayer"]
          dampening_frac=0.01
          targets="Linear",
          scheme="W8A8",
      ),
  ]

  # Define data collator
  def data_collator(batch):
      import torch
      assert len(batch) == 1
      collated = {}
      for k, v in batch[0].items():
          if v is None:
              continue
          if k == "input_ids":
              collated[k] = torch.LongTensor(v)
          elif k == "pixel_values":
              collated[k] = torch.tensor(v, dtype=torch.bfloat16)
          else:
              collated[k] = torch.tensor(v)
      return collated


  # Apply quantization
  oneshot(
      model=model,
      dataset=ds, 
      recipe=recipe,
      max_seq_length=max_seq_len,
      data_collator=data_collator,
  )
  
  # Save to disk in compressed-tensors format
  save_path = model_name + "-quantized.w8a8
  model.save_pretrained(save_path)
  tokenizer.save_pretrained(save_path)
  print(f"Model and tokenizer saved to: {save_path}")
  ```
</details>
 

## Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (version 1), MMLU-pro, GPQA, HumanEval and MBPP.
Non-coding tasks were evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), whereas coding tasks were evaluated with a fork of [evalplus](https://github.com/neuralmagic/evalplus).
[vLLM](https://docs.vllm.ai/en/stable/) is used as the engine in all cases.

<details>
  <summary>Evaluation details</summary>

  **MMLU**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **ARC Challenge**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks arc_challenge \
    --num_fewshot 25 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **GSM8k**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks gsm8k \
    --num_fewshot 8 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **Hellaswag**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks hellaswag \
    --num_fewshot 10 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **Winogrande**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks winogrande \
    --num_fewshot 5 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

  **TruthfulQA**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks truthfulqa \
    --num_fewshot 0 \
    --apply_chat_template\
    --batch_size auto
  ```

  **MMLU-pro**
  ```
  lm_eval \
    --model vllm \
    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
    --tasks mmlu_pro \
    --num_fewshot 5 \
    --apply_chat_template\
    --fewshot_as_multiturn \
    --batch_size auto
  ```

**Coding**

The commands below can be used for mbpp by simply replacing the dataset name.

*Generation*
```
python3 codegen/generate.py \
  --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 \
  --bs 16 \
  --temperature 0.2 \
  --n_samples 50 \
  --root "." \
  --dataset humaneval

```

*Sanitization*
```
python3 evalplus/sanitize.py \
  humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2
```

*Evaluation*
```
evalplus.evaluate \
  --dataset humaneval \
  --samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2-sanitized
```
</details>

### Accuracy

#### Open LLM Leaderboard evaluation scores
<table>
  <tr>
   <th>Category
   </th>
   <th>Benchmark
   </th>
   <th>Mistral-Small-3.1-24B-Instruct-2503
   </th>
   <th>Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8<br>(this model)
   </th>
   <th>Recovery
   </th>
  </tr>
  <tr>
   <td rowspan="7" ><strong>OpenLLM v1</strong>
   </td>
   <td>MMLU (5-shot)
   </td>
   <td>80.67
   </td>
   <td>80.40
   </td>
   <td>99.7%
   </td>
  </tr>
  <tr>
   <td>ARC Challenge (25-shot)
   </td>
   <td>72.78
   </td>
   <td>73.46
   </td>
   <td>100.9%
   </td>
  </tr>
  <tr>
   <td>GSM-8K (5-shot, strict-match)
   </td>
   <td>65.35
   </td>
   <td>70.58
   </td>
   <td>108.0%
   </td>
  </tr>
  <tr>
   <td>Hellaswag (10-shot)
   </td>
   <td>83.70
   </td>
   <td>82.26
   </td>
   <td>98.3%
   </td>
  </tr>
  <tr>
   <td>Winogrande (5-shot)
   </td>
   <td>83.74
   </td>
   <td>80.90
   </td>
   <td>96.6%
   </td>
  </tr>
  <tr>
   <td>TruthfulQA (0-shot, mc2)
   </td>
   <td>70.62
   </td>
   <td>69.15
   </td>
   <td>97.9%
   </td>
  </tr>
  <tr>
   <td><strong>Average</strong>
   </td>
   <td><strong>76.14</strong>
   </td>
   <td><strong>76.13</strong>
   </td>
   <td><strong>100.0%</strong>
   </td>
  </tr>
  <tr>
   <td rowspan="3" ><strong></strong>
   </td>
   <td>MMLU-Pro (5-shot)
   </td>
   <td>67.25
   </td>
   <td>66.54
   </td>
   <td>98.9%
   </td>
  </tr>
  <tr>
   <td>GPQA CoT main (5-shot)
   </td>
   <td>42.63
   </td>
   <td>44.64
   </td>
   <td>104.7%
   </td>
  </tr>
  <tr>
   <td>GPQA CoT diamond (5-shot)
   </td>
   <td>45.96
   </td>
   <td>41.92
   </td>
   <td>91.2%
   </td>
  </tr>
  <tr>
   <td rowspan="4" ><strong>Coding</strong>
   </td>
   <td>HumanEval pass@1
   </td>
   <td>84.70
   </td>
   <td>
   </td>
   <td>%
   </td>
  </tr>
  <tr>
   <td>HumanEval+ pass@1
   </td>
   <td>79.50
   </td>
   <td>
   </td>
   <td>%
   </td>
  </tr>
  <tr>
   <td>MBPP pass@1
   </td>
   <td>71.10
   </td>
   <td>
   </td>
   <td>%
   </td>
  </tr>
  <tr>
   <td>MBPP+ pass@1
   </td>
   <td>60.60
   </td>
   <td>
   </td>
   <td>%
   </td>
  </tr>
</table>
Category	Benchmark	Mistral-Small-3.1-24B-Instruct-2503	Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 (this model)	Recovery
OpenLLM v1	MMLU (5-shot)	80.67	80.40	99.7%
	ARC Challenge (25-shot)	72.78	73.46	100.9%
	GSM-8K (5-shot, strict-match)	65.35	70.58	108.0%
	Hellaswag (10-shot)	83.70	82.26	98.3%
	Winogrande (5-shot)	83.74	80.90	96.6%
	TruthfulQA (0-shot, mc2)	70.62	69.15	97.9%
	Average	76.14	76.13	100.0%
	MMLU-Pro (5-shot)	67.25	66.54	98.9%
	GPQA CoT main (5-shot)	42.63	44.64	104.7%
	GPQA CoT diamond (5-shot)	45.96	41.92	91.2%
Coding	HumanEval pass@1	84.70		%
	HumanEval+ pass@1	79.50		%
	MBPP pass@1	71.10		%
	MBPP+ pass@1	60.60		%