See our collection for all new Models.

Mistral-Small-3.1-24B-Instruct - GGUF

High-quality GGUF quantizations of Mistral-Small-3.1-24B-Instruct-2503

📖 Model Description

This repository contains GGUF quantized versions of the Mistral-Small-3.1-24B-Instruct-2503 model, optimized for efficient inference using llama.cpp, Ollama, and other GGUF-compatible frameworks.

Mistral Small 3.1 builds upon Mistral Small 3 (2501) and adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

Key Features ✨

🖼️ Vision Capabilities: Analyze images and provide insights based on visual content
🌍 Multilingual: Supports 24+ languages including English, French, German, Spanish, Japanese, Chinese, Arabic, and more
🤖 Agent-Centric: Best-in-class agentic capabilities with native function calling and JSON output
🧠 Advanced Reasoning: State-of-the-art conversational and reasoning capabilities
📏 Long Context: 128k token context window for processing large documents
⚖️ Apache 2.0 License: Open license for commercial and non-commercial use
🎯 System Prompt Support: Strong adherence to system prompts

🚀 Quick Start

Using with Ollama

# Download and run the model
ollama run hf.co/your-username/mistral-small-3.1-24b-instruct-gguf:q4_k_m

# Or create from local file
ollama create mistral-small-local -f Modelfile
ollama run mistral-small-local

Modelfile for Ollama:

FROM ./mistral-small-3.1-24b-instruct-q4_k_m.gguf

TEMPLATE """<s>[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT][INST]{{ .Prompt }}[/INST]"""

PARAMETER temperature 0.15
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 128000

SYSTEM """You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You are knowledgeable, creative, and provide detailed responses while being concise when appropriate. You have vision capabilities and can analyze images when provided."""

Using with llama.cpp

# Download the model
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models

# Run inference
./llama-cli -m ./models/mistral-small-3.1-24b-instruct-q4_k_m.gguf -p "<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Hello! How are you?[/INST]" -n 256 -c 128000

Using with Python (llama-cpp-python)

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="./mistral-small-3.1-24b-instruct-q4_k_m.gguf",
    n_ctx=128000,  # Full 128k context window
    n_threads=8,   # Number of CPU threads
    n_gpu_layers=35,  # Number of layers to offload to GPU (if available)
    verbose=False
)

# Generate response with proper template
prompt = "<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Explain quantum computing in simple terms[/INST]"

response = llm(
    prompt,
    max_tokens=512,
    temperature=0.15,
    top_p=0.9,
)

print(response["choices"][0]["text"])

📊 Quantization Variants

Variant	File Size	Description	Use Case	Quality Loss
F16	44.0 GB	Original precision	Maximum quality, research	None
Q8_0	23.3 GB	8-bit quantization	High-end inference	Minimal
Q6_K	18.0 GB	6-bit K-quantization	Production quality	Very Low
Q5_K_M	15.6 GB	5-bit K-quant (medium)	Recommended balance	Low
Q5_K_S	15.2 GB	5-bit K-quant (small)	Balanced quality/size	Low
Q5_1	16.5 GB	5-bit legacy	Legacy compatibility	Low
Q5_0	15.2 GB	5-bit legacy	Legacy compatibility	Low
Q4_K_M	13.4 GB	4-bit K-quant (medium)	Popular choice	Moderate
Q4_K_S	12.5 GB	4-bit K-quant (small)	Resource constrained	Moderate
Q4_1	13.9 GB	4-bit legacy	Legacy compatibility	Moderate
Q4_0	12.5 GB	4-bit legacy	Legacy compatibility	Moderate
Q3_K_L	11.5 GB	3-bit K-quant (large)	Limited resources	Noticeable
Q3_K_M	10.8 GB	3-bit K-quant (medium)	Limited resources	Noticeable
Q3_K_S	9.7 GB	3-bit K-quant (small)	Very limited resources	Noticeable
Q2_K	8.3 GB	2-bit K-quantization	Extreme compression	Significant

🎯 Recommended Variants

Q5_K_M (15.6 GB): Best balance of quality and size for most users
Q4_K_M (13.4 GB): Good quality with smaller size, popular choice
Q6_K (18.0 GB): Near-original quality if you have the resources
Q3_K_M (10.8 GB): Minimum viable quality for resource-constrained environments

🛠️ Model Details

Architecture

Model Type: Mistral Small 3.1
Parameters: 24 billion
Context Length: 128,000 tokens (128k)
Vocabulary Size: 131,000 (Tekken tokenizer)
Architecture: Transformer with sliding window attention
Precision: Various GGUF quantizations
Base Model: Mistral-Small-3.1-24B-Base-2503

Capabilities

🖼️ Vision Understanding: State-of-the-art multimodal capabilities for image analysis
📝 Instruction Following: Excellent at following complex instructions
💻 Code Generation: Strong programming capabilities across multiple languages
🧮 Mathematical Reasoning: Advanced math and logical reasoning (69.30% on MATH benchmark)
🌍 Multilingual: Native support for 24+ languages
💬 Conversation: Natural dialogue and chat capabilities
🔧 Function Calling: Native tool calling and JSON output capabilities
📚 Long Context: Process documents up to 128k tokens

Benchmark Performance

Text Benchmarks

MMLU: 80.62% (general knowledge)
MATH: 69.30% (mathematical reasoning)
HumanEval: 88.41% (code generation)
GPQA: 44.42% (graduate-level questions)

Vision Benchmarks

MMMU: 64.00% (multimodal understanding)
ChartQA: 86.24% (chart analysis)
DocVQA: 94.08% (document visual Q&A)
AI2D: 93.72% (scientific diagrams)

Long Context

RULER 32K: 93.96%
RULER 128K: 81.20%
LongBench v2: 37.18%

💬 Chat Template

This model uses the Mistral V7-Tekken instruction format:

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]

Examples:

Basic Chat:

<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Write a Python function to calculate the factorial of a number[/INST]

With Vision:

<s>[SYSTEM_PROMPT]You are a helpful AI assistant with vision capabilities.[/SYSTEM_PROMPT][INST]What do you see in this image? <image>[/INST]

🔧 Technical Requirements

Minimum System Requirements

Variant	RAM	VRAM (GPU)	Storage
Q2_K	16 GB	8 GB	10 GB
Q3_K_M	24 GB	12 GB	12 GB
Q4_K_M	32 GB	16 GB	15 GB
Q5_K_M	48 GB	18 GB	17 GB
Q6_K+	64 GB	20+ GB	20+ GB

Recommended Hardware

CPU: Modern multi-core processor (12+ cores recommended for 128k context)
RAM: 64+ GB for optimal performance with long contexts
GPU: RTX 3090/4090 (24GB), RTX 6000 Ada (48GB), or A100 for GPU acceleration
Storage: NVMe SSD for faster model loading

Note: The original model requires ~55GB GPU RAM in bf16/fp16. Quantized versions significantly reduce memory requirements.

📥 Download Instructions

Individual Files

# Download specific quantization
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models

# Download all files (warning: ~200GB total)
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf --local-dir ./models

Git LFS

git clone https://huggingface.co/your-username/mistral-small-3.1-24b-instruct-gguf
cd mistral-small-3.1-24b-instruct-gguf
git lfs pull

🧪 Usage Examples

Code Generation

<s>[SYSTEM_PROMPT]You are an expert programmer.[/SYSTEM_PROMPT][INST]Create a REST API using FastAPI for a todo application with CRUD operations[/INST]

Creative Writing

<s>[SYSTEM_PROMPT]You are a creative writing assistant.[/SYSTEM_PROMPT][INST]Write a short story about a time traveler who accidentally changes a small detail in the past[/INST]

Data Analysis Help

<s>[SYSTEM_PROMPT]You are a data science expert.[/SYSTEM_PROMPT][INST]I have a dataset with missing values. Explain different strategies to handle them and provide Python code examples[/INST]

Multilingual Support

<s>[SYSTEM_PROMPT]Tu es un assistant multilingue.[/SYSTEM_PROMPT][INST]Explique-moi la différence entre l'apprentissage supervisé et non supervisé[/INST]

Function Calling

# The model supports native function calling for tool use
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                }
            }
        }
    }
]

🏆 Ideal Use Cases

💬 Fast-response conversational agents
⚡ Low-latency function calling
🎓 Subject matter experts via fine-tuning
🏠 Local inference for privacy-sensitive applications
💻 Programming and mathematical reasoning
📄 Long document understanding and analysis
🖼️ Visual content analysis and description
🌍 Multilingual applications

⚠️ Limitations

Quantization Loss: Lower bit quantizations (Q2, Q3) may show reduced quality, especially for complex reasoning
Context Limit: Maximum context length of 128,000 tokens
Knowledge Cutoff: Training data cutoff as of October 2023
Hallucination: May generate plausible but incorrect information
Bias: May reflect biases present in training data
Vision: Text-only quantizations don't preserve vision capabilities optimally

🛡️ Ethical Considerations

Use responsibly and in accordance with Mistral AI's usage policies
Be aware of potential biases in model outputs
Verify important information from model responses
Consider privacy implications when processing sensitive data
Follow applicable laws and regulations in your jurisdiction
Respect copyright when analyzing images or documents

📄 License

This model is released under the Apache 2.0 License, same as the original Mistral-Small-3.1-24B-Instruct-2503 model.

🙏 Acknowledgments

Mistral AI for the original Mistral-Small-3.1-24B-Instruct-2503 model
Georgi Gerganov and the llama.cpp team for GGUF format and quantization tools
The open-source community for continued development of efficient inference tools

📞 Support

Issues: Report issues with these GGUF files in this repository
Original Model: For questions about the base model, refer to Mistral AI's repository
llama.cpp: For technical issues with inference, check the llama.cpp repository
Ollama: For Ollama-specific issues, see Ollama documentation

Made with ❤️ by the open-source community

🤗 Hugging Face • 🦙 llama.cpp • 🧠 Mistral AI • 📱 Ollama

muranAI
/

Mistral-Small-3.1-24B-Instruct-2503-GGUF