--- language: - en - fr - de - es - pt - it - ja - ko - ru - zh - ar - fa - id - ms - ne - pl - ro - sr - sv - tr - uk - vi - hi - bn license: apache-2.0 library_name: vllm base_model: - mistralai/Mistral-Small-3.1-24B-Instruct-2503 pipeline_tag: image-text-to-text tags: - neuralmagic - redhat - llmcompressor - quantized - int4 --- # Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 ## Model Overview - **Model Architecture:** Mistral3ForConditionalGeneration - **Input:** Text / Image - **Output:** Text - **Model Optimizations:** - **Activation quantization:** INT8 - **Weight quantization:** INT8 - **Intended Use Cases:** It is ideal for: - Fast-response conversational agents. - Low-latency function calling. - Subject matter experts via fine-tuning. - Local inference for hobbyists and organizations handling sensitive data. - Programming and math reasoning. - Long document understanding. - Visual understanding. - **Out-of-scope:** This model is not specifically designed or evaluated for all downstream purposes, thus: 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. - **Release Date:** 04/15/2025 - **Version:** 1.0 - **Model Developers:** RedHat (Neural Magic) ### Model Optimizations This model was obtained by quantizing activations and weights of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. ## Deployment This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8" number_gpus = 1 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Give me a short introduction to large language model." llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompt, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation
Creation details This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. ```python from transformers import AutoProcessor from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.transformers import oneshot from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration from PIL import Image import io # Load model model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" model_name = model_stub.split("/")[-1] num_text_samples = 1024 num_vison_samples = 1024 max_seq_len = 8192 processor = AutoProcessor.from_pretrained(model_stub) model = TraceableMistral3ForConditionalGeneration.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", ) # Text-only data subset def preprocess_text(example): input = { "text": processor.apply_chat_template( example["messages"], add_generation_prompt=False, ), "images" = None, } tokenized_input = processor(**input, max_length=max_seq_len, truncation=True) tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None) tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None) dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(num_text_samples) dst = dst.map(preprocess_text, remove_columns=dst.column_names) # Text + vision data subset def preprocess_vision(example): messages = [] image = None for message in example["messages"]: message_content = [] for content in message["content"] if content["type"] == "text": message_content = {"type": "text", "text": content["text"]} else: message_content = {"type": "image"}} image = Image.open(io.BytesIO(content["image"])) messages.append( { "role": message["role"], "content": message_content, } ) input = { "text": processor.apply_chat_template( messages, add_generation_prompt=False, ), "images" = image, } tokenized_input = processor(**input, max_length=max_seq_len, truncation=True) tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None) tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None) dsv = load_dataset("neuralmagic/calibration", name="VLLM", split="train").select(num_vision_samples) dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names) # Interleave subsets ds = interleave_datasets((dsv, dst)) # Configure the quantization algorithm and scheme recipe = [ SmoothQuantModifier(smoothing_strength=0.8), GPTQModifier( ignore=["language_model.lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"] sequential_targets=["MistralDecoderLayer"] dampening_frac=0.01 targets="Linear", scheme="W8A8", ), ] # Define data collator def data_collator(batch): import torch assert len(batch) == 1 collated = {} for k, v in batch[0].items(): if v is None: continue if k == "input_ids": collated[k] = torch.LongTensor(v) elif k == "pixel_values": collated[k] = torch.tensor(v, dtype=torch.bfloat16) else: collated[k] = torch.tensor(v) return collated # Apply quantization oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=max_seq_len, data_collator=data_collator, ) # Save to disk in compressed-tensors format save_path = model_name + "-quantized.w8a8 model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ```
## Evaluation The model was evaluated on the OpenLLM leaderboard tasks (version 1), MMLU-pro, GPQA, HumanEval and MBPP. Non-coding tasks were evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), whereas coding tasks were evaluated with a fork of [evalplus](https://github.com/neuralmagic/evalplus). [vLLM](https://docs.vllm.ai/en/stable/) is used as the engine in all cases.
Evaluation details **MMLU** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \ --tasks mmlu \ --num_fewshot 5 \ --apply_chat_template\ --fewshot_as_multiturn \ --batch_size auto ``` **ARC Challenge** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \ --tasks arc_challenge \ --num_fewshot 25 \ --apply_chat_template\ --fewshot_as_multiturn \ --batch_size auto ``` **GSM8k** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \ --tasks gsm8k \ --num_fewshot 8 \ --apply_chat_template\ --fewshot_as_multiturn \ --batch_size auto ``` **Hellaswag** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \ --tasks hellaswag \ --num_fewshot 10 \ --apply_chat_template\ --fewshot_as_multiturn \ --batch_size auto ``` **Winogrande** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \ --tasks winogrande \ --num_fewshot 5 \ --apply_chat_template\ --fewshot_as_multiturn \ --batch_size auto ``` **TruthfulQA** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \ --tasks truthfulqa \ --num_fewshot 0 \ --apply_chat_template\ --batch_size auto ``` **MMLU-pro** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \ --tasks mmlu_pro \ --num_fewshot 5 \ --apply_chat_template\ --fewshot_as_multiturn \ --batch_size auto ``` **Coding** The commands below can be used for mbpp by simply replacing the dataset name. *Generation* ``` python3 codegen/generate.py \ --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 \ --bs 16 \ --temperature 0.2 \ --n_samples 50 \ --root "." \ --dataset humaneval ``` *Sanitization* ``` python3 evalplus/sanitize.py \ humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2 ``` *Evaluation* ``` evalplus.evaluate \ --dataset humaneval \ --samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2-sanitized ```
### Accuracy #### Open LLM Leaderboard evaluation scores
Category Benchmark Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8
(this model)
Recovery
OpenLLM v1 MMLU (5-shot) 80.67 80.40 99.7%
ARC Challenge (25-shot) 72.78 73.46 100.9%
GSM-8K (5-shot, strict-match) 65.35 70.58 108.0%
Hellaswag (10-shot) 83.70 82.26 98.3%
Winogrande (5-shot) 83.74 80.90 96.6%
TruthfulQA (0-shot, mc2) 70.62 69.15 97.9%
Average 76.14 76.13 100.0%
MMLU-Pro (5-shot) 67.25 66.54 98.9%
GPQA CoT main (5-shot) 42.63 44.64 104.7%
GPQA CoT diamond (5-shot) 45.96 41.92 91.2%
Coding HumanEval pass@1 84.70 %
HumanEval+ pass@1 79.50 %
MBPP pass@1 71.10 %
MBPP+ pass@1 60.60 %