|
--- |
|
library_name: vllm |
|
language: |
|
- en |
|
- fr |
|
- de |
|
- es |
|
- pt |
|
- it |
|
- ja |
|
- ko |
|
- ru |
|
- zh |
|
- ar |
|
- fa |
|
- id |
|
- ms |
|
- ne |
|
- pl |
|
- ro |
|
- sr |
|
- sv |
|
- tr |
|
- uk |
|
- vi |
|
- hi |
|
- bn |
|
license: apache-2.0 |
|
inference: false |
|
base_model: |
|
- mistralai/Mistral-Small-3.1-24B-Base-2503 |
|
- togethercomputer/mistral-3.2-instruct-2506 |
|
extra_gated_description: >- |
|
If you want to learn more about how we process your personal data, please read |
|
our <a href="https://mistral.ai/terms/">Privacy Policy</a>. |
|
tags: |
|
- mistral-common |
|
- quantized |
|
model_type: mistral |
|
quantization: bitsandbytes |
|
--- |
|
|
|
# Mistral-Small-3.2-24B-Instruct-2506 (Quantized) |
|
|
|
This is a quantized version of [togethercomputer/mistral-3.2-instruct-2506](https://huggingface.co/togethercomputer/mistral-3.2-instruct-2506), optimized for reduced memory usage while maintaining performance. |
|
|
|
Mistral-Small-3.2-24B-Instruct-2506 is a minor update of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503). |
|
|
|
## Quantization Details |
|
|
|
This model has been quantized to reduce memory requirements while preserving model quality. The quantization reduces the model size significantly compared to the original fp16/bf16 version. |
|
|
|
## Base Model Improvements |
|
|
|
Small-3.2 improves in the following categories: |
|
- **Instruction following**: Small-3.2 is better at following precise instructions |
|
- **Repetition errors**: Small-3.2 produces less infinite generations or repetitive answers |
|
- **Function calling**: Small-3.2's function calling template is more robust |
|
|
|
In all other categories Small-3.2 should match or slightly improve compared to [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503). |
|
|
|
## Key Features |
|
- same as [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503#key-features) |
|
- Reduced memory footprint through quantization |
|
- Optimized for inference with maintained quality |
|
|
|
## Usage |
|
|
|
The quantized model can be used with the following frameworks; |
|
- [`vllm (recommended)`](https://github.com/vllm-project/vllm) |
|
- [`transformers`](https://github.com/huggingface/transformers) |
|
|
|
**Note 1**: We recommend using a relatively low temperature, such as `temperature=0.15`. |
|
|
|
**Note 2**: Make sure to add a system prompt to the model to best tailor it to your needs. |
|
|
|
### Memory Requirements |
|
|
|
This quantized version requires significantly less GPU memory than the original model: |
|
- Original: ~55 GB of GPU RAM in bf16 or fp16 |
|
- Quantized: Reduced memory footprint (exact requirements depend on quantization method used) |
|
|
|
## License |
|
|
|
This model inherits the same license as the base model: Apache-2.0 |
|
|
|
## Original Model |
|
|
|
For benchmark results and detailed usage examples, please refer to the original model: [togethercomputer/mistral-3.2-instruct-2506](https://huggingface.co/togethercomputer/mistral-3.2-instruct-2506) |