File size: 13,370 Bytes
f69321e b979b5b ae4aee0 b979b5b f69321e b979b5b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 |
---
license: apache-2.0
base_model: mistralai/Mistral-Small-Instruct-2501
tags:
- quantized
- gguf
- mistral
- instruct
- llama.cpp
- ollama
- vision
- multimodal
- multilingual
model_type: mistral
inference: false
language:
- en
- fr
- de
- es
- pt
- it
- ja
- ko
- ru
- zh
- ar
- fa
- id
- ms
- ne
- pl
- ro
- sr
- sv
- tr
- uk
- vi
- hi
- bn
pipeline_tag: text-generation
---
<p style="margin-bottom: 0;">
<em>See <a href="https://huggingface.co/muranAI">our collection</a> for all new Models.</em>
</p>
<div style="display: flex; gap: 5px; align-items: center; ">
<a href="https://muranai.com/">
<img src="https://muranai.com/images/logo_white.png" width="133">
</a>
</div>
# Mistral-Small-3.1-24B-Instruct - GGUF
<div align="center">
**High-quality GGUF quantizations of Mistral-Small-3.1-24B-Instruct-2503**
[](#quantization-variants)
[](#license)
[](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
[](#model-details)
</div>
## ๐ Model Description
This repository contains **GGUF quantized versions** of the [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) model, optimized for efficient inference using [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://ollama.com/), and other GGUF-compatible frameworks.
**Mistral Small 3.1** builds upon Mistral Small 3 (2501) and adds **state-of-the-art vision understanding** and enhances **long context capabilities up to 128k tokens** without compromising text performance. With **24 billion parameters**, this model achieves top-tier capabilities in both text and vision tasks.
### Key Features โจ
- **๐ผ๏ธ Vision Capabilities**: Analyze images and provide insights based on visual content
- **๐ Multilingual**: Supports 24+ languages including English, French, German, Spanish, Japanese, Chinese, Arabic, and more
- **๐ค Agent-Centric**: Best-in-class agentic capabilities with native function calling and JSON output
- **๐ง Advanced Reasoning**: State-of-the-art conversational and reasoning capabilities
- **๐ Long Context**: 128k token context window for processing large documents
- **โ๏ธ Apache 2.0 License**: Open license for commercial and non-commercial use
- **๐ฏ System Prompt Support**: Strong adherence to system prompts
## ๐ Quick Start
### Using with Ollama
```bash
# Download and run the model
ollama run hf.co/your-username/mistral-small-3.1-24b-instruct-gguf:q4_k_m
# Or create from local file
ollama create mistral-small-local -f Modelfile
ollama run mistral-small-local
```
**Modelfile for Ollama:**
```dockerfile
FROM ./mistral-small-3.1-24b-instruct-q4_k_m.gguf
TEMPLATE """<s>[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT][INST]{{ .Prompt }}[/INST]"""
PARAMETER temperature 0.15
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 128000
SYSTEM """You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You are knowledgeable, creative, and provide detailed responses while being concise when appropriate. You have vision capabilities and can analyze images when provided."""
```
### Using with llama.cpp
```bash
# Download the model
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models
# Run inference
./llama-cli -m ./models/mistral-small-3.1-24b-instruct-q4_k_m.gguf -p "<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Hello! How are you?[/INST]" -n 256 -c 128000
```
### Using with Python (llama-cpp-python)
```python
from llama_cpp import Llama
# Load the model
llm = Llama(
model_path="./mistral-small-3.1-24b-instruct-q4_k_m.gguf",
n_ctx=128000, # Full 128k context window
n_threads=8, # Number of CPU threads
n_gpu_layers=35, # Number of layers to offload to GPU (if available)
verbose=False
)
# Generate response with proper template
prompt = "<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Explain quantum computing in simple terms[/INST]"
response = llm(
prompt,
max_tokens=512,
temperature=0.15,
top_p=0.9,
)
print(response["choices"][0]["text"])
```
## ๐ Quantization Variants
| Variant | File Size | Description | Use Case | Quality Loss |
|---------|-----------|-------------|----------|--------------|
| **F16** | 44.0 GB | Original precision | Maximum quality, research | None |
| **Q8_0** | 23.3 GB | 8-bit quantization | High-end inference | Minimal |
| **Q6_K** | 18.0 GB | 6-bit K-quantization | Production quality | Very Low |
| **Q5_K_M** | 15.6 GB | 5-bit K-quant (medium) | **Recommended balance** | Low |
| **Q5_K_S** | 15.2 GB | 5-bit K-quant (small) | Balanced quality/size | Low |
| **Q5_1** | 16.5 GB | 5-bit legacy | Legacy compatibility | Low |
| **Q5_0** | 15.2 GB | 5-bit legacy | Legacy compatibility | Low |
| **Q4_K_M** | 13.4 GB | 4-bit K-quant (medium) | **Popular choice** | Moderate |
| **Q4_K_S** | 12.5 GB | 4-bit K-quant (small) | Resource constrained | Moderate |
| **Q4_1** | 13.9 GB | 4-bit legacy | Legacy compatibility | Moderate |
| **Q4_0** | 12.5 GB | 4-bit legacy | Legacy compatibility | Moderate |
| **Q3_K_L** | 11.5 GB | 3-bit K-quant (large) | Limited resources | Noticeable |
| **Q3_K_M** | 10.8 GB | 3-bit K-quant (medium) | Limited resources | Noticeable |
| **Q3_K_S** | 9.7 GB | 3-bit K-quant (small) | Very limited resources | Noticeable |
| **Q2_K** | 8.3 GB | 2-bit K-quantization | Extreme compression | Significant |
### ๐ฏ Recommended Variants
- **Q5_K_M** (15.6 GB): Best balance of quality and size for most users
- **Q4_K_M** (13.4 GB): Good quality with smaller size, popular choice
- **Q6_K** (18.0 GB): Near-original quality if you have the resources
- **Q3_K_M** (10.8 GB): Minimum viable quality for resource-constrained environments
## ๐ ๏ธ Model Details
### Architecture
- **Model Type**: Mistral Small 3.1
- **Parameters**: 24 billion
- **Context Length**: 128,000 tokens (128k)
- **Vocabulary Size**: 131,000 (Tekken tokenizer)
- **Architecture**: Transformer with sliding window attention
- **Precision**: Various GGUF quantizations
- **Base Model**: [Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)
### Capabilities
- **๐ผ๏ธ Vision Understanding**: State-of-the-art multimodal capabilities for image analysis
- **๐ Instruction Following**: Excellent at following complex instructions
- **๐ป Code Generation**: Strong programming capabilities across multiple languages
- **๐งฎ Mathematical Reasoning**: Advanced math and logical reasoning (69.30% on MATH benchmark)
- **๐ Multilingual**: Native support for 24+ languages
- **๐ฌ Conversation**: Natural dialogue and chat capabilities
- **๐ง Function Calling**: Native tool calling and JSON output capabilities
- **๐ Long Context**: Process documents up to 128k tokens
### Benchmark Performance
#### Text Benchmarks
- **MMLU**: 80.62% (general knowledge)
- **MATH**: 69.30% (mathematical reasoning)
- **HumanEval**: 88.41% (code generation)
- **GPQA**: 44.42% (graduate-level questions)
#### Vision Benchmarks
- **MMMU**: 64.00% (multimodal understanding)
- **ChartQA**: 86.24% (chart analysis)
- **DocVQA**: 94.08% (document visual Q&A)
- **AI2D**: 93.72% (scientific diagrams)
#### Long Context
- **RULER 32K**: 93.96%
- **RULER 128K**: 81.20%
- **LongBench v2**: 37.18%
## ๐ฌ Chat Template
This model uses the **Mistral V7-Tekken instruction format**:
```
<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]
```
**Examples:**
**Basic Chat:**
```
<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Write a Python function to calculate the factorial of a number[/INST]
```
**With Vision:**
```
<s>[SYSTEM_PROMPT]You are a helpful AI assistant with vision capabilities.[/SYSTEM_PROMPT][INST]What do you see in this image? <image>[/INST]
```
## ๐ง Technical Requirements
### Minimum System Requirements
| Variant | RAM | VRAM (GPU) | Storage |
|---------|-----|------------|---------|
| Q2_K | 16 GB | 8 GB | 10 GB |
| Q3_K_M | 24 GB | 12 GB | 12 GB |
| Q4_K_M | 32 GB | 16 GB | 15 GB |
| Q5_K_M | 48 GB | 18 GB | 17 GB |
| Q6_K+ | 64 GB | 20+ GB | 20+ GB |
### Recommended Hardware
- **CPU**: Modern multi-core processor (12+ cores recommended for 128k context)
- **RAM**: 64+ GB for optimal performance with long contexts
- **GPU**: RTX 3090/4090 (24GB), RTX 6000 Ada (48GB), or A100 for GPU acceleration
- **Storage**: NVMe SSD for faster model loading
**Note**: The original model requires ~55GB GPU RAM in bf16/fp16. Quantized versions significantly reduce memory requirements.
## ๐ฅ Download Instructions
### Individual Files
```bash
# Download specific quantization
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models
# Download all files (warning: ~200GB total)
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf --local-dir ./models
```
### Git LFS
```bash
git clone https://huggingface.co/your-username/mistral-small-3.1-24b-instruct-gguf
cd mistral-small-3.1-24b-instruct-gguf
git lfs pull
```
## ๐งช Usage Examples
### Code Generation
```
<s>[SYSTEM_PROMPT]You are an expert programmer.[/SYSTEM_PROMPT][INST]Create a REST API using FastAPI for a todo application with CRUD operations[/INST]
```
### Creative Writing
```
<s>[SYSTEM_PROMPT]You are a creative writing assistant.[/SYSTEM_PROMPT][INST]Write a short story about a time traveler who accidentally changes a small detail in the past[/INST]
```
### Data Analysis Help
```
<s>[SYSTEM_PROMPT]You are a data science expert.[/SYSTEM_PROMPT][INST]I have a dataset with missing values. Explain different strategies to handle them and provide Python code examples[/INST]
```
### Multilingual Support
```
<s>[SYSTEM_PROMPT]Tu es un assistant multilingue.[/SYSTEM_PROMPT][INST]Explique-moi la diffรฉrence entre l'apprentissage supervisรฉ et non supervisรฉ[/INST]
```
### Function Calling
```python
# The model supports native function calling for tool use
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
}
}
}
}
]
```
## ๐ Ideal Use Cases
- **๐ฌ Fast-response conversational agents**
- **โก Low-latency function calling**
- **๐ Subject matter experts via fine-tuning**
- **๐ Local inference for privacy-sensitive applications**
- **๐ป Programming and mathematical reasoning**
- **๐ Long document understanding and analysis**
- **๐ผ๏ธ Visual content analysis and description**
- **๐ Multilingual applications**
## โ ๏ธ Limitations
- **Quantization Loss**: Lower bit quantizations (Q2, Q3) may show reduced quality, especially for complex reasoning
- **Context Limit**: Maximum context length of 128,000 tokens
- **Knowledge Cutoff**: Training data cutoff as of October 2023
- **Hallucination**: May generate plausible but incorrect information
- **Bias**: May reflect biases present in training data
- **Vision**: Text-only quantizations don't preserve vision capabilities optimally
## ๐ก๏ธ Ethical Considerations
- Use responsibly and in accordance with Mistral AI's usage policies
- Be aware of potential biases in model outputs
- Verify important information from model responses
- Consider privacy implications when processing sensitive data
- Follow applicable laws and regulations in your jurisdiction
- Respect copyright when analyzing images or documents
## ๐ License
This model is released under the **Apache 2.0 License**, same as the original Mistral-Small-3.1-24B-Instruct-2503 model.
## ๐ Acknowledgments
- **Mistral AI** for the original Mistral-Small-3.1-24B-Instruct-2503 model
- **Georgi Gerganov** and the llama.cpp team for GGUF format and quantization tools
- **The open-source community** for continued development of efficient inference tools
## ๐ Support
- **Issues**: Report issues with these GGUF files in this repository
- **Original Model**: For questions about the base model, refer to [Mistral AI's repository](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
- **llama.cpp**: For technical issues with inference, check the [llama.cpp repository](https://github.com/ggerganov/llama.cpp)
- **Ollama**: For Ollama-specific issues, see [Ollama documentation](https://ollama.com/)
---
<div align="center">
**Made with โค๏ธ by the open-source community**
[๐ค Hugging Face](https://huggingface.co/) โข [๐ฆ llama.cpp](https://github.com/ggerganov/llama.cpp) โข [๐ง Mistral AI](https://mistral.ai/) โข [๐ฑ Ollama](https://ollama.com/)
</div> |