--- license: apache-2.0 base_model: mistralai/Mistral-Small-Instruct-2501 tags: - quantized - gguf - mistral - instruct - llama.cpp - ollama - vision - multimodal - multilingual model_type: mistral inference: false language: - en - fr - de - es - pt - it - ja - ko - ru - zh - ar - fa - id - ms - ne - pl - ro - sr - sv - tr - uk - vi - hi - bn pipeline_tag: text-generation ---

See our collection for all new Models.

# Mistral-Small-3.1-24B-Instruct - GGUF
**High-quality GGUF quantizations of Mistral-Small-3.1-24B-Instruct-2503** [![](https://img.shields.io/badge/Quantization-GGUF-blue)](#quantization-variants) [![](https://img.shields.io/badge/License-Apache%202.0-green)](#license) [![](https://img.shields.io/badge/Base%20Model-Mistral%20Small%203.1%2024B-orange)](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) [![](https://img.shields.io/badge/Context-128k%20tokens-purple)](#model-details)
## ๐Ÿ“– Model Description This repository contains **GGUF quantized versions** of the [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) model, optimized for efficient inference using [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://ollama.com/), and other GGUF-compatible frameworks. **Mistral Small 3.1** builds upon Mistral Small 3 (2501) and adds **state-of-the-art vision understanding** and enhances **long context capabilities up to 128k tokens** without compromising text performance. With **24 billion parameters**, this model achieves top-tier capabilities in both text and vision tasks. ### Key Features โœจ - **๐Ÿ–ผ๏ธ Vision Capabilities**: Analyze images and provide insights based on visual content - **๐ŸŒ Multilingual**: Supports 24+ languages including English, French, German, Spanish, Japanese, Chinese, Arabic, and more - **๐Ÿค– Agent-Centric**: Best-in-class agentic capabilities with native function calling and JSON output - **๐Ÿง  Advanced Reasoning**: State-of-the-art conversational and reasoning capabilities - **๐Ÿ“ Long Context**: 128k token context window for processing large documents - **โš–๏ธ Apache 2.0 License**: Open license for commercial and non-commercial use - **๐ŸŽฏ System Prompt Support**: Strong adherence to system prompts ## ๐Ÿš€ Quick Start ### Using with Ollama ```bash # Download and run the model ollama run hf.co/your-username/mistral-small-3.1-24b-instruct-gguf:q4_k_m # Or create from local file ollama create mistral-small-local -f Modelfile ollama run mistral-small-local ``` **Modelfile for Ollama:** ```dockerfile FROM ./mistral-small-3.1-24b-instruct-q4_k_m.gguf TEMPLATE """[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT][INST]{{ .Prompt }}[/INST]""" PARAMETER temperature 0.15 PARAMETER top_p 0.9 PARAMETER top_k 40 PARAMETER repeat_penalty 1.1 PARAMETER num_ctx 128000 SYSTEM """You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You are knowledgeable, creative, and provide detailed responses while being concise when appropriate. You have vision capabilities and can analyze images when provided.""" ``` ### Using with llama.cpp ```bash # Download the model huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models # Run inference ./llama-cli -m ./models/mistral-small-3.1-24b-instruct-q4_k_m.gguf -p "[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Hello! How are you?[/INST]" -n 256 -c 128000 ``` ### Using with Python (llama-cpp-python) ```python from llama_cpp import Llama # Load the model llm = Llama( model_path="./mistral-small-3.1-24b-instruct-q4_k_m.gguf", n_ctx=128000, # Full 128k context window n_threads=8, # Number of CPU threads n_gpu_layers=35, # Number of layers to offload to GPU (if available) verbose=False ) # Generate response with proper template prompt = "[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Explain quantum computing in simple terms[/INST]" response = llm( prompt, max_tokens=512, temperature=0.15, top_p=0.9, ) print(response["choices"][0]["text"]) ``` ## ๐Ÿ“Š Quantization Variants | Variant | File Size | Description | Use Case | Quality Loss | |---------|-----------|-------------|----------|--------------| | **F16** | 44.0 GB | Original precision | Maximum quality, research | None | | **Q8_0** | 23.3 GB | 8-bit quantization | High-end inference | Minimal | | **Q6_K** | 18.0 GB | 6-bit K-quantization | Production quality | Very Low | | **Q5_K_M** | 15.6 GB | 5-bit K-quant (medium) | **Recommended balance** | Low | | **Q5_K_S** | 15.2 GB | 5-bit K-quant (small) | Balanced quality/size | Low | | **Q5_1** | 16.5 GB | 5-bit legacy | Legacy compatibility | Low | | **Q5_0** | 15.2 GB | 5-bit legacy | Legacy compatibility | Low | | **Q4_K_M** | 13.4 GB | 4-bit K-quant (medium) | **Popular choice** | Moderate | | **Q4_K_S** | 12.5 GB | 4-bit K-quant (small) | Resource constrained | Moderate | | **Q4_1** | 13.9 GB | 4-bit legacy | Legacy compatibility | Moderate | | **Q4_0** | 12.5 GB | 4-bit legacy | Legacy compatibility | Moderate | | **Q3_K_L** | 11.5 GB | 3-bit K-quant (large) | Limited resources | Noticeable | | **Q3_K_M** | 10.8 GB | 3-bit K-quant (medium) | Limited resources | Noticeable | | **Q3_K_S** | 9.7 GB | 3-bit K-quant (small) | Very limited resources | Noticeable | | **Q2_K** | 8.3 GB | 2-bit K-quantization | Extreme compression | Significant | ### ๐ŸŽฏ Recommended Variants - **Q5_K_M** (15.6 GB): Best balance of quality and size for most users - **Q4_K_M** (13.4 GB): Good quality with smaller size, popular choice - **Q6_K** (18.0 GB): Near-original quality if you have the resources - **Q3_K_M** (10.8 GB): Minimum viable quality for resource-constrained environments ## ๐Ÿ› ๏ธ Model Details ### Architecture - **Model Type**: Mistral Small 3.1 - **Parameters**: 24 billion - **Context Length**: 128,000 tokens (128k) - **Vocabulary Size**: 131,000 (Tekken tokenizer) - **Architecture**: Transformer with sliding window attention - **Precision**: Various GGUF quantizations - **Base Model**: [Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) ### Capabilities - **๐Ÿ–ผ๏ธ Vision Understanding**: State-of-the-art multimodal capabilities for image analysis - **๐Ÿ“ Instruction Following**: Excellent at following complex instructions - **๐Ÿ’ป Code Generation**: Strong programming capabilities across multiple languages - **๐Ÿงฎ Mathematical Reasoning**: Advanced math and logical reasoning (69.30% on MATH benchmark) - **๐ŸŒ Multilingual**: Native support for 24+ languages - **๐Ÿ’ฌ Conversation**: Natural dialogue and chat capabilities - **๐Ÿ”ง Function Calling**: Native tool calling and JSON output capabilities - **๐Ÿ“š Long Context**: Process documents up to 128k tokens ### Benchmark Performance #### Text Benchmarks - **MMLU**: 80.62% (general knowledge) - **MATH**: 69.30% (mathematical reasoning) - **HumanEval**: 88.41% (code generation) - **GPQA**: 44.42% (graduate-level questions) #### Vision Benchmarks - **MMMU**: 64.00% (multimodal understanding) - **ChartQA**: 86.24% (chart analysis) - **DocVQA**: 94.08% (document visual Q&A) - **AI2D**: 93.72% (scientific diagrams) #### Long Context - **RULER 32K**: 93.96% - **RULER 128K**: 81.20% - **LongBench v2**: 37.18% ## ๐Ÿ’ฌ Chat Template This model uses the **Mistral V7-Tekken instruction format**: ``` [SYSTEM_PROMPT][/SYSTEM_PROMPT][INST][/INST][INST][/INST] ``` **Examples:** **Basic Chat:** ``` [SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Write a Python function to calculate the factorial of a number[/INST] ``` **With Vision:** ``` [SYSTEM_PROMPT]You are a helpful AI assistant with vision capabilities.[/SYSTEM_PROMPT][INST]What do you see in this image? [/INST] ``` ## ๐Ÿ”ง Technical Requirements ### Minimum System Requirements | Variant | RAM | VRAM (GPU) | Storage | |---------|-----|------------|---------| | Q2_K | 16 GB | 8 GB | 10 GB | | Q3_K_M | 24 GB | 12 GB | 12 GB | | Q4_K_M | 32 GB | 16 GB | 15 GB | | Q5_K_M | 48 GB | 18 GB | 17 GB | | Q6_K+ | 64 GB | 20+ GB | 20+ GB | ### Recommended Hardware - **CPU**: Modern multi-core processor (12+ cores recommended for 128k context) - **RAM**: 64+ GB for optimal performance with long contexts - **GPU**: RTX 3090/4090 (24GB), RTX 6000 Ada (48GB), or A100 for GPU acceleration - **Storage**: NVMe SSD for faster model loading **Note**: The original model requires ~55GB GPU RAM in bf16/fp16. Quantized versions significantly reduce memory requirements. ## ๐Ÿ“ฅ Download Instructions ### Individual Files ```bash # Download specific quantization huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models # Download all files (warning: ~200GB total) huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf --local-dir ./models ``` ### Git LFS ```bash git clone https://huggingface.co/your-username/mistral-small-3.1-24b-instruct-gguf cd mistral-small-3.1-24b-instruct-gguf git lfs pull ``` ## ๐Ÿงช Usage Examples ### Code Generation ``` [SYSTEM_PROMPT]You are an expert programmer.[/SYSTEM_PROMPT][INST]Create a REST API using FastAPI for a todo application with CRUD operations[/INST] ``` ### Creative Writing ``` [SYSTEM_PROMPT]You are a creative writing assistant.[/SYSTEM_PROMPT][INST]Write a short story about a time traveler who accidentally changes a small detail in the past[/INST] ``` ### Data Analysis Help ``` [SYSTEM_PROMPT]You are a data science expert.[/SYSTEM_PROMPT][INST]I have a dataset with missing values. Explain different strategies to handle them and provide Python code examples[/INST] ``` ### Multilingual Support ``` [SYSTEM_PROMPT]Tu es un assistant multilingue.[/SYSTEM_PROMPT][INST]Explique-moi la diffรฉrence entre l'apprentissage supervisรฉ et non supervisรฉ[/INST] ``` ### Function Calling ```python # The model supports native function calling for tool use tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"} } } } } ] ``` ## ๐Ÿ† Ideal Use Cases - **๐Ÿ’ฌ Fast-response conversational agents** - **โšก Low-latency function calling** - **๐ŸŽ“ Subject matter experts via fine-tuning** - **๐Ÿ  Local inference for privacy-sensitive applications** - **๐Ÿ’ป Programming and mathematical reasoning** - **๐Ÿ“„ Long document understanding and analysis** - **๐Ÿ–ผ๏ธ Visual content analysis and description** - **๐ŸŒ Multilingual applications** ## โš ๏ธ Limitations - **Quantization Loss**: Lower bit quantizations (Q2, Q3) may show reduced quality, especially for complex reasoning - **Context Limit**: Maximum context length of 128,000 tokens - **Knowledge Cutoff**: Training data cutoff as of October 2023 - **Hallucination**: May generate plausible but incorrect information - **Bias**: May reflect biases present in training data - **Vision**: Text-only quantizations don't preserve vision capabilities optimally ## ๐Ÿ›ก๏ธ Ethical Considerations - Use responsibly and in accordance with Mistral AI's usage policies - Be aware of potential biases in model outputs - Verify important information from model responses - Consider privacy implications when processing sensitive data - Follow applicable laws and regulations in your jurisdiction - Respect copyright when analyzing images or documents ## ๐Ÿ“„ License This model is released under the **Apache 2.0 License**, same as the original Mistral-Small-3.1-24B-Instruct-2503 model. ## ๐Ÿ™ Acknowledgments - **Mistral AI** for the original Mistral-Small-3.1-24B-Instruct-2503 model - **Georgi Gerganov** and the llama.cpp team for GGUF format and quantization tools - **The open-source community** for continued development of efficient inference tools ## ๐Ÿ“ž Support - **Issues**: Report issues with these GGUF files in this repository - **Original Model**: For questions about the base model, refer to [Mistral AI's repository](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) - **llama.cpp**: For technical issues with inference, check the [llama.cpp repository](https://github.com/ggerganov/llama.cpp) - **Ollama**: For Ollama-specific issues, see [Ollama documentation](https://ollama.com/) ---
**Made with โค๏ธ by the open-source community** [๐Ÿค— Hugging Face](https://huggingface.co/) โ€ข [๐Ÿฆ™ llama.cpp](https://github.com/ggerganov/llama.cpp) โ€ข [๐Ÿง  Mistral AI](https://mistral.ai/) โ€ข [๐Ÿ“ฑ Ollama](https://ollama.com/)