---
license: apache-2.0
base_model: mistralai/Mistral-Small-Instruct-2501
tags:
- quantized
- gguf
- mistral
- instruct
- llama.cpp
- ollama
- vision
- multimodal
- multilingual
model_type: mistral
inference: false
language:
- en
- fr
- de
- es
- pt
- it
- ja
- ko
- ru
- zh
- ar
- fa
- id
- ms
- ne
- pl
- ro
- sr
- sv
- tr
- uk
- vi
- hi
- bn
pipeline_tag: text-generation
---
See our collection for all new Models.
# Mistral-Small-3.1-24B-Instruct - GGUF
**High-quality GGUF quantizations of Mistral-Small-3.1-24B-Instruct-2503**
[](#quantization-variants)
[](#license)
[](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
[](#model-details)
## ๐ Model Description
This repository contains **GGUF quantized versions** of the [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) model, optimized for efficient inference using [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://ollama.com/), and other GGUF-compatible frameworks.
**Mistral Small 3.1** builds upon Mistral Small 3 (2501) and adds **state-of-the-art vision understanding** and enhances **long context capabilities up to 128k tokens** without compromising text performance. With **24 billion parameters**, this model achieves top-tier capabilities in both text and vision tasks.
### Key Features โจ
- **๐ผ๏ธ Vision Capabilities**: Analyze images and provide insights based on visual content
- **๐ Multilingual**: Supports 24+ languages including English, French, German, Spanish, Japanese, Chinese, Arabic, and more
- **๐ค Agent-Centric**: Best-in-class agentic capabilities with native function calling and JSON output
- **๐ง Advanced Reasoning**: State-of-the-art conversational and reasoning capabilities
- **๐ Long Context**: 128k token context window for processing large documents
- **โ๏ธ Apache 2.0 License**: Open license for commercial and non-commercial use
- **๐ฏ System Prompt Support**: Strong adherence to system prompts
## ๐ Quick Start
### Using with Ollama
```bash
# Download and run the model
ollama run hf.co/your-username/mistral-small-3.1-24b-instruct-gguf:q4_k_m
# Or create from local file
ollama create mistral-small-local -f Modelfile
ollama run mistral-small-local
```
**Modelfile for Ollama:**
```dockerfile
FROM ./mistral-small-3.1-24b-instruct-q4_k_m.gguf
TEMPLATE """[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT][INST]{{ .Prompt }}[/INST]"""
PARAMETER temperature 0.15
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 128000
SYSTEM """You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You are knowledgeable, creative, and provide detailed responses while being concise when appropriate. You have vision capabilities and can analyze images when provided."""
```
### Using with llama.cpp
```bash
# Download the model
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models
# Run inference
./llama-cli -m ./models/mistral-small-3.1-24b-instruct-q4_k_m.gguf -p "[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Hello! How are you?[/INST]" -n 256 -c 128000
```
### Using with Python (llama-cpp-python)
```python
from llama_cpp import Llama
# Load the model
llm = Llama(
model_path="./mistral-small-3.1-24b-instruct-q4_k_m.gguf",
n_ctx=128000, # Full 128k context window
n_threads=8, # Number of CPU threads
n_gpu_layers=35, # Number of layers to offload to GPU (if available)
verbose=False
)
# Generate response with proper template
prompt = "[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Explain quantum computing in simple terms[/INST]"
response = llm(
prompt,
max_tokens=512,
temperature=0.15,
top_p=0.9,
)
print(response["choices"][0]["text"])
```
## ๐ Quantization Variants
| Variant | File Size | Description | Use Case | Quality Loss |
|---------|-----------|-------------|----------|--------------|
| **F16** | 44.0 GB | Original precision | Maximum quality, research | None |
| **Q8_0** | 23.3 GB | 8-bit quantization | High-end inference | Minimal |
| **Q6_K** | 18.0 GB | 6-bit K-quantization | Production quality | Very Low |
| **Q5_K_M** | 15.6 GB | 5-bit K-quant (medium) | **Recommended balance** | Low |
| **Q5_K_S** | 15.2 GB | 5-bit K-quant (small) | Balanced quality/size | Low |
| **Q5_1** | 16.5 GB | 5-bit legacy | Legacy compatibility | Low |
| **Q5_0** | 15.2 GB | 5-bit legacy | Legacy compatibility | Low |
| **Q4_K_M** | 13.4 GB | 4-bit K-quant (medium) | **Popular choice** | Moderate |
| **Q4_K_S** | 12.5 GB | 4-bit K-quant (small) | Resource constrained | Moderate |
| **Q4_1** | 13.9 GB | 4-bit legacy | Legacy compatibility | Moderate |
| **Q4_0** | 12.5 GB | 4-bit legacy | Legacy compatibility | Moderate |
| **Q3_K_L** | 11.5 GB | 3-bit K-quant (large) | Limited resources | Noticeable |
| **Q3_K_M** | 10.8 GB | 3-bit K-quant (medium) | Limited resources | Noticeable |
| **Q3_K_S** | 9.7 GB | 3-bit K-quant (small) | Very limited resources | Noticeable |
| **Q2_K** | 8.3 GB | 2-bit K-quantization | Extreme compression | Significant |
### ๐ฏ Recommended Variants
- **Q5_K_M** (15.6 GB): Best balance of quality and size for most users
- **Q4_K_M** (13.4 GB): Good quality with smaller size, popular choice
- **Q6_K** (18.0 GB): Near-original quality if you have the resources
- **Q3_K_M** (10.8 GB): Minimum viable quality for resource-constrained environments
## ๐ ๏ธ Model Details
### Architecture
- **Model Type**: Mistral Small 3.1
- **Parameters**: 24 billion
- **Context Length**: 128,000 tokens (128k)
- **Vocabulary Size**: 131,000 (Tekken tokenizer)
- **Architecture**: Transformer with sliding window attention
- **Precision**: Various GGUF quantizations
- **Base Model**: [Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)
### Capabilities
- **๐ผ๏ธ Vision Understanding**: State-of-the-art multimodal capabilities for image analysis
- **๐ Instruction Following**: Excellent at following complex instructions
- **๐ป Code Generation**: Strong programming capabilities across multiple languages
- **๐งฎ Mathematical Reasoning**: Advanced math and logical reasoning (69.30% on MATH benchmark)
- **๐ Multilingual**: Native support for 24+ languages
- **๐ฌ Conversation**: Natural dialogue and chat capabilities
- **๐ง Function Calling**: Native tool calling and JSON output capabilities
- **๐ Long Context**: Process documents up to 128k tokens
### Benchmark Performance
#### Text Benchmarks
- **MMLU**: 80.62% (general knowledge)
- **MATH**: 69.30% (mathematical reasoning)
- **HumanEval**: 88.41% (code generation)
- **GPQA**: 44.42% (graduate-level questions)
#### Vision Benchmarks
- **MMMU**: 64.00% (multimodal understanding)
- **ChartQA**: 86.24% (chart analysis)
- **DocVQA**: 94.08% (document visual Q&A)
- **AI2D**: 93.72% (scientific diagrams)
#### Long Context
- **RULER 32K**: 93.96%
- **RULER 128K**: 81.20%
- **LongBench v2**: 37.18%
## ๐ฌ Chat Template
This model uses the **Mistral V7-Tekken instruction format**:
```
[SYSTEM_PROMPT][/SYSTEM_PROMPT][INST][/INST][INST][/INST]
```
**Examples:**
**Basic Chat:**
```
[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Write a Python function to calculate the factorial of a number[/INST]
```
**With Vision:**
```
[SYSTEM_PROMPT]You are a helpful AI assistant with vision capabilities.[/SYSTEM_PROMPT][INST]What do you see in this image? [/INST]
```
## ๐ง Technical Requirements
### Minimum System Requirements
| Variant | RAM | VRAM (GPU) | Storage |
|---------|-----|------------|---------|
| Q2_K | 16 GB | 8 GB | 10 GB |
| Q3_K_M | 24 GB | 12 GB | 12 GB |
| Q4_K_M | 32 GB | 16 GB | 15 GB |
| Q5_K_M | 48 GB | 18 GB | 17 GB |
| Q6_K+ | 64 GB | 20+ GB | 20+ GB |
### Recommended Hardware
- **CPU**: Modern multi-core processor (12+ cores recommended for 128k context)
- **RAM**: 64+ GB for optimal performance with long contexts
- **GPU**: RTX 3090/4090 (24GB), RTX 6000 Ada (48GB), or A100 for GPU acceleration
- **Storage**: NVMe SSD for faster model loading
**Note**: The original model requires ~55GB GPU RAM in bf16/fp16. Quantized versions significantly reduce memory requirements.
## ๐ฅ Download Instructions
### Individual Files
```bash
# Download specific quantization
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models
# Download all files (warning: ~200GB total)
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf --local-dir ./models
```
### Git LFS
```bash
git clone https://huggingface.co/your-username/mistral-small-3.1-24b-instruct-gguf
cd mistral-small-3.1-24b-instruct-gguf
git lfs pull
```
## ๐งช Usage Examples
### Code Generation
```
[SYSTEM_PROMPT]You are an expert programmer.[/SYSTEM_PROMPT][INST]Create a REST API using FastAPI for a todo application with CRUD operations[/INST]
```
### Creative Writing
```
[SYSTEM_PROMPT]You are a creative writing assistant.[/SYSTEM_PROMPT][INST]Write a short story about a time traveler who accidentally changes a small detail in the past[/INST]
```
### Data Analysis Help
```
[SYSTEM_PROMPT]You are a data science expert.[/SYSTEM_PROMPT][INST]I have a dataset with missing values. Explain different strategies to handle them and provide Python code examples[/INST]
```
### Multilingual Support
```
[SYSTEM_PROMPT]Tu es un assistant multilingue.[/SYSTEM_PROMPT][INST]Explique-moi la diffรฉrence entre l'apprentissage supervisรฉ et non supervisรฉ[/INST]
```
### Function Calling
```python
# The model supports native function calling for tool use
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
}
}
}
}
]
```
## ๐ Ideal Use Cases
- **๐ฌ Fast-response conversational agents**
- **โก Low-latency function calling**
- **๐ Subject matter experts via fine-tuning**
- **๐ Local inference for privacy-sensitive applications**
- **๐ป Programming and mathematical reasoning**
- **๐ Long document understanding and analysis**
- **๐ผ๏ธ Visual content analysis and description**
- **๐ Multilingual applications**
## โ ๏ธ Limitations
- **Quantization Loss**: Lower bit quantizations (Q2, Q3) may show reduced quality, especially for complex reasoning
- **Context Limit**: Maximum context length of 128,000 tokens
- **Knowledge Cutoff**: Training data cutoff as of October 2023
- **Hallucination**: May generate plausible but incorrect information
- **Bias**: May reflect biases present in training data
- **Vision**: Text-only quantizations don't preserve vision capabilities optimally
## ๐ก๏ธ Ethical Considerations
- Use responsibly and in accordance with Mistral AI's usage policies
- Be aware of potential biases in model outputs
- Verify important information from model responses
- Consider privacy implications when processing sensitive data
- Follow applicable laws and regulations in your jurisdiction
- Respect copyright when analyzing images or documents
## ๐ License
This model is released under the **Apache 2.0 License**, same as the original Mistral-Small-3.1-24B-Instruct-2503 model.
## ๐ Acknowledgments
- **Mistral AI** for the original Mistral-Small-3.1-24B-Instruct-2503 model
- **Georgi Gerganov** and the llama.cpp team for GGUF format and quantization tools
- **The open-source community** for continued development of efficient inference tools
## ๐ Support
- **Issues**: Report issues with these GGUF files in this repository
- **Original Model**: For questions about the base model, refer to [Mistral AI's repository](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
- **llama.cpp**: For technical issues with inference, check the [llama.cpp repository](https://github.com/ggerganov/llama.cpp)
- **Ollama**: For Ollama-specific issues, see [Ollama documentation](https://ollama.com/)
---
**Made with โค๏ธ by the open-source community**
[๐ค Hugging Face](https://huggingface.co/) โข [๐ฆ llama.cpp](https://github.com/ggerganov/llama.cpp) โข [๐ง Mistral AI](https://mistral.ai/) โข [๐ฑ Ollama](https://ollama.com/)