See our collection for all new Models.

Mistral-Small-3.1-24B-Instruct - GGUF

High-quality GGUF quantizations of Mistral-Small-3.1-24B-Instruct-2503

📖 Model Description

This repository contains GGUF quantized versions of the Mistral-Small-3.1-24B-Instruct-2503 model, optimized for efficient inference using llama.cpp, Ollama, and other GGUF-compatible frameworks.

Mistral Small 3.1 builds upon Mistral Small 3 (2501) and adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

Key Features ✨

  • 🖼️ Vision Capabilities: Analyze images and provide insights based on visual content
  • 🌍 Multilingual: Supports 24+ languages including English, French, German, Spanish, Japanese, Chinese, Arabic, and more
  • 🤖 Agent-Centric: Best-in-class agentic capabilities with native function calling and JSON output
  • 🧠 Advanced Reasoning: State-of-the-art conversational and reasoning capabilities
  • 📏 Long Context: 128k token context window for processing large documents
  • ⚖️ Apache 2.0 License: Open license for commercial and non-commercial use
  • 🎯 System Prompt Support: Strong adherence to system prompts

🚀 Quick Start

Using with Ollama

# Download and run the model
ollama run hf.co/your-username/mistral-small-3.1-24b-instruct-gguf:q4_k_m

# Or create from local file
ollama create mistral-small-local -f Modelfile
ollama run mistral-small-local

Modelfile for Ollama:

FROM ./mistral-small-3.1-24b-instruct-q4_k_m.gguf

TEMPLATE """<s>[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT][INST]{{ .Prompt }}[/INST]"""

PARAMETER temperature 0.15
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 128000

SYSTEM """You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You are knowledgeable, creative, and provide detailed responses while being concise when appropriate. You have vision capabilities and can analyze images when provided."""

Using with llama.cpp

# Download the model
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models

# Run inference
./llama-cli -m ./models/mistral-small-3.1-24b-instruct-q4_k_m.gguf -p "<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Hello! How are you?[/INST]" -n 256 -c 128000

Using with Python (llama-cpp-python)

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="./mistral-small-3.1-24b-instruct-q4_k_m.gguf",
    n_ctx=128000,  # Full 128k context window
    n_threads=8,   # Number of CPU threads
    n_gpu_layers=35,  # Number of layers to offload to GPU (if available)
    verbose=False
)

# Generate response with proper template
prompt = "<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Explain quantum computing in simple terms[/INST]"

response = llm(
    prompt,
    max_tokens=512,
    temperature=0.15,
    top_p=0.9,
)

print(response["choices"][0]["text"])

📊 Quantization Variants

Variant File Size Description Use Case Quality Loss
F16 44.0 GB Original precision Maximum quality, research None
Q8_0 23.3 GB 8-bit quantization High-end inference Minimal
Q6_K 18.0 GB 6-bit K-quantization Production quality Very Low
Q5_K_M 15.6 GB 5-bit K-quant (medium) Recommended balance Low
Q5_K_S 15.2 GB 5-bit K-quant (small) Balanced quality/size Low
Q5_1 16.5 GB 5-bit legacy Legacy compatibility Low
Q5_0 15.2 GB 5-bit legacy Legacy compatibility Low
Q4_K_M 13.4 GB 4-bit K-quant (medium) Popular choice Moderate
Q4_K_S 12.5 GB 4-bit K-quant (small) Resource constrained Moderate
Q4_1 13.9 GB 4-bit legacy Legacy compatibility Moderate
Q4_0 12.5 GB 4-bit legacy Legacy compatibility Moderate
Q3_K_L 11.5 GB 3-bit K-quant (large) Limited resources Noticeable
Q3_K_M 10.8 GB 3-bit K-quant (medium) Limited resources Noticeable
Q3_K_S 9.7 GB 3-bit K-quant (small) Very limited resources Noticeable
Q2_K 8.3 GB 2-bit K-quantization Extreme compression Significant

🎯 Recommended Variants

  • Q5_K_M (15.6 GB): Best balance of quality and size for most users
  • Q4_K_M (13.4 GB): Good quality with smaller size, popular choice
  • Q6_K (18.0 GB): Near-original quality if you have the resources
  • Q3_K_M (10.8 GB): Minimum viable quality for resource-constrained environments

🛠️ Model Details

Architecture

  • Model Type: Mistral Small 3.1
  • Parameters: 24 billion
  • Context Length: 128,000 tokens (128k)
  • Vocabulary Size: 131,000 (Tekken tokenizer)
  • Architecture: Transformer with sliding window attention
  • Precision: Various GGUF quantizations
  • Base Model: Mistral-Small-3.1-24B-Base-2503

Capabilities

  • 🖼️ Vision Understanding: State-of-the-art multimodal capabilities for image analysis
  • 📝 Instruction Following: Excellent at following complex instructions
  • 💻 Code Generation: Strong programming capabilities across multiple languages
  • 🧮 Mathematical Reasoning: Advanced math and logical reasoning (69.30% on MATH benchmark)
  • 🌍 Multilingual: Native support for 24+ languages
  • 💬 Conversation: Natural dialogue and chat capabilities
  • 🔧 Function Calling: Native tool calling and JSON output capabilities
  • 📚 Long Context: Process documents up to 128k tokens

Benchmark Performance

Text Benchmarks

  • MMLU: 80.62% (general knowledge)
  • MATH: 69.30% (mathematical reasoning)
  • HumanEval: 88.41% (code generation)
  • GPQA: 44.42% (graduate-level questions)

Vision Benchmarks

  • MMMU: 64.00% (multimodal understanding)
  • ChartQA: 86.24% (chart analysis)
  • DocVQA: 94.08% (document visual Q&A)
  • AI2D: 93.72% (scientific diagrams)

Long Context

  • RULER 32K: 93.96%
  • RULER 128K: 81.20%
  • LongBench v2: 37.18%

💬 Chat Template

This model uses the Mistral V7-Tekken instruction format:

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]

Examples:

Basic Chat:

<s>[SYSTEM_PROMPT]You are a helpful AI assistant.[/SYSTEM_PROMPT][INST]Write a Python function to calculate the factorial of a number[/INST]

With Vision:

<s>[SYSTEM_PROMPT]You are a helpful AI assistant with vision capabilities.[/SYSTEM_PROMPT][INST]What do you see in this image? <image>[/INST]

🔧 Technical Requirements

Minimum System Requirements

Variant RAM VRAM (GPU) Storage
Q2_K 16 GB 8 GB 10 GB
Q3_K_M 24 GB 12 GB 12 GB
Q4_K_M 32 GB 16 GB 15 GB
Q5_K_M 48 GB 18 GB 17 GB
Q6_K+ 64 GB 20+ GB 20+ GB

Recommended Hardware

  • CPU: Modern multi-core processor (12+ cores recommended for 128k context)
  • RAM: 64+ GB for optimal performance with long contexts
  • GPU: RTX 3090/4090 (24GB), RTX 6000 Ada (48GB), or A100 for GPU acceleration
  • Storage: NVMe SSD for faster model loading

Note: The original model requires ~55GB GPU RAM in bf16/fp16. Quantized versions significantly reduce memory requirements.

📥 Download Instructions

Individual Files

# Download specific quantization
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf mistral-small-3.1-24b-instruct-q4_k_m.gguf --local-dir ./models

# Download all files (warning: ~200GB total)
huggingface-cli download your-username/mistral-small-3.1-24b-instruct-gguf --local-dir ./models

Git LFS

git clone https://huggingface.co/your-username/mistral-small-3.1-24b-instruct-gguf
cd mistral-small-3.1-24b-instruct-gguf
git lfs pull

🧪 Usage Examples

Code Generation

<s>[SYSTEM_PROMPT]You are an expert programmer.[/SYSTEM_PROMPT][INST]Create a REST API using FastAPI for a todo application with CRUD operations[/INST]

Creative Writing

<s>[SYSTEM_PROMPT]You are a creative writing assistant.[/SYSTEM_PROMPT][INST]Write a short story about a time traveler who accidentally changes a small detail in the past[/INST]

Data Analysis Help

<s>[SYSTEM_PROMPT]You are a data science expert.[/SYSTEM_PROMPT][INST]I have a dataset with missing values. Explain different strategies to handle them and provide Python code examples[/INST]

Multilingual Support

<s>[SYSTEM_PROMPT]Tu es un assistant multilingue.[/SYSTEM_PROMPT][INST]Explique-moi la différence entre l'apprentissage supervisé et non supervisé[/INST]

Function Calling

# The model supports native function calling for tool use
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                }
            }
        }
    }
]

🏆 Ideal Use Cases

  • 💬 Fast-response conversational agents
  • ⚡ Low-latency function calling
  • 🎓 Subject matter experts via fine-tuning
  • 🏠 Local inference for privacy-sensitive applications
  • 💻 Programming and mathematical reasoning
  • 📄 Long document understanding and analysis
  • 🖼️ Visual content analysis and description
  • 🌍 Multilingual applications

⚠️ Limitations

  • Quantization Loss: Lower bit quantizations (Q2, Q3) may show reduced quality, especially for complex reasoning
  • Context Limit: Maximum context length of 128,000 tokens
  • Knowledge Cutoff: Training data cutoff as of October 2023
  • Hallucination: May generate plausible but incorrect information
  • Bias: May reflect biases present in training data
  • Vision: Text-only quantizations don't preserve vision capabilities optimally

🛡️ Ethical Considerations

  • Use responsibly and in accordance with Mistral AI's usage policies
  • Be aware of potential biases in model outputs
  • Verify important information from model responses
  • Consider privacy implications when processing sensitive data
  • Follow applicable laws and regulations in your jurisdiction
  • Respect copyright when analyzing images or documents

📄 License

This model is released under the Apache 2.0 License, same as the original Mistral-Small-3.1-24B-Instruct-2503 model.

🙏 Acknowledgments

  • Mistral AI for the original Mistral-Small-3.1-24B-Instruct-2503 model
  • Georgi Gerganov and the llama.cpp team for GGUF format and quantization tools
  • The open-source community for continued development of efficient inference tools

📞 Support


Made with ❤️ by the open-source community

🤗 Hugging Face🦙 llama.cpp🧠 Mistral AI📱 Ollama

Downloads last month
0
GGUF
Model size
23.6B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for muranAI/Mistral-Small-3.1-24B-Instruct-2503-GGUF