vschandramourya's picture
Upload README.md with huggingface_hub
d7cf3e9 verified
---
library_name: vllm
language:
- en
- fr
- de
- es
- pt
- it
- ja
- ko
- ru
- zh
- ar
- fa
- id
- ms
- ne
- pl
- ro
- sr
- sv
- tr
- uk
- vi
- hi
- bn
license: apache-2.0
inference: false
base_model:
- mistralai/Mistral-Small-3.1-24B-Base-2503
- togethercomputer/mistral-3.2-instruct-2506
extra_gated_description: >-
If you want to learn more about how we process your personal data, please read
our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
tags:
- mistral-common
- quantized
model_type: mistral
quantization: bitsandbytes
---
# Mistral-Small-3.2-24B-Instruct-2506 (Quantized)
This is a quantized version of [togethercomputer/mistral-3.2-instruct-2506](https://huggingface.co/togethercomputer/mistral-3.2-instruct-2506), optimized for reduced memory usage while maintaining performance.
Mistral-Small-3.2-24B-Instruct-2506 is a minor update of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503).
## Quantization Details
This model has been quantized to reduce memory requirements while preserving model quality. The quantization reduces the model size significantly compared to the original fp16/bf16 version.
## Base Model Improvements
Small-3.2 improves in the following categories:
- **Instruction following**: Small-3.2 is better at following precise instructions
- **Repetition errors**: Small-3.2 produces less infinite generations or repetitive answers
- **Function calling**: Small-3.2's function calling template is more robust
In all other categories Small-3.2 should match or slightly improve compared to [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503).
## Key Features
- same as [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503#key-features)
- Reduced memory footprint through quantization
- Optimized for inference with maintained quality
## Usage
The quantized model can be used with the following frameworks;
- [`vllm (recommended)`](https://github.com/vllm-project/vllm)
- [`transformers`](https://github.com/huggingface/transformers)
**Note 1**: We recommend using a relatively low temperature, such as `temperature=0.15`.
**Note 2**: Make sure to add a system prompt to the model to best tailor it to your needs.
### Memory Requirements
This quantized version requires significantly less GPU memory than the original model:
- Original: ~55 GB of GPU RAM in bf16 or fp16
- Quantized: Reduced memory footprint (exact requirements depend on quantization method used)
## License
This model inherits the same license as the base model: Apache-2.0
## Original Model
For benchmark results and detailed usage examples, please refer to the original model: [togethercomputer/mistral-3.2-instruct-2506](https://huggingface.co/togethercomputer/mistral-3.2-instruct-2506)