Update README.md

befb342 verified 5 days ago

7.99 kB

	---
	language:
	- en
	- fr
	- de
	- es
	- it
	- pt
	- nl
	- hi
	license: apache-2.0
	library_name: vllm
	inference: false
	extra_gated_description: >-
	If you want to learn more about how we process your personal data, please read
	our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
	pipeline_tag: audio-text-to-text
	---
	# Voxtral Mini 1.0 (3B) - 2507

	Voxtral Mini is an enhancement of [Ministral 3B](https://mistral.ai/news/ministraux), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

	Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral).

	## Key Features

	Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
	- Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
	- Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
	- Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
	- Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
	- Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
	- Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral-3B

	## Benchmark Results

	### Audio

	Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)

	### Text

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/iH9V8JVtMoaGlqJd6FIri.png)

	## Usage

	The model can be used with the following frameworks;
	- [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)

	Notes:

	- `temperature=0.2` and `top_p=0.95` for chat completion (e.g. Audio Understanding) and `temperature=0.0` for transcription
	- Multiple audios per message and multiple user turns with audio are supported
	- System prompts are not yet supported

	### vLLM (recommended)

	We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).

	#### Installation

	Make sure to install vllm from "main", we recommend using `uv`:

	```
	uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
	```

	Doing so should automatically install [`mistral_common >= 1.8.1`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.1).

	To check:
	```
	python -c "import mistral_common; print(mistral_common.__version__)"
	```

	#### Offline

	You can test that your vLLM setup works as expected by cloning the vLLM repo:

	```sh
	git clone https://github.com/vllm-project/vllm && cd vllm
	```

	and then running:

	```sh
	python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
	```

	#### Serve

	We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.

	1. Spin up a server:

	```
	vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
	```

	Note: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.


	2. To ping the client you can use a simple Python snippet. See the following examples.


	### Audio Instruct

	Leverage the audio capabilities of Voxtral-Mini-3B-2507 to chat.

	Make sure that your client has `mistral-common` with audio installed:

	```sh
	pip install --upgrade mistral_common\[audio\]
	```

	<details>
	<summary>Python snippet</summary>

	```py
	from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
	from mistral_common.audio import Audio
	from huggingface_hub import hf_hub_download

	from openai import OpenAI

	# Modify OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://<your-server-host>:8000/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	models = client.models.list()
	model = models.data[0].id

	obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
	bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")

	def file_to_chunk(file: str) -> AudioChunk:
	audio = Audio.from_file(file, strict=False)
	return AudioChunk.from_audio(audio)

	text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?")
	user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()

	print(30 * "=" + "USER 1" + 30 * "=")
	print(text_chunk.text)
	print("\n\n")

	response = client.chat.completions.create(
	model=model,
	messages=[user_msg],
	temperature=0.2,
	top_p=0.95,
	)
	content = response.choices[0].message.content

	print(30 * "=" + "BOT 1" + 30 * "=")
	print(content)
	print("\n\n")
	# The speaker who is more inspiring is the one who delivered the farewell address, as they express
	# gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of
	# self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast,
	# the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it
	# lacks the emotional and motivational content of the farewell address.

	# Differences:
	# - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy.
	# - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content.


	messages = [
	user_msg,
	AssistantMessage(content=content).to_openai(),
	UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
	]
	print(30 * "=" + "USER 2" + 30 * "=")
	print(messages[-1]["content"])
	print("\n\n")

	response = client.chat.completions.create(
	model=model,
	messages=messages,
	temperature=0.2,
	top_p=0.95,
	)
	content = response.choices[0].message.content
	print(30 * "=" + "BOT 2" + 30 * "=")
	print(content)
	```
	</details>

	#### Transcription

	Voxtral-Mini-3B-2507 has powerful transcription capabilities!

	Make sure that your client has `mistral-common` with audio installed:

	```sh
	pip install --upgrade mistral_common\[audio\]
	```

	<details>
	<summary>Python snippet</summary>

	```python
	from mistral_common.protocol.transcription.request import TranscriptionRequest
	from mistral_common.protocol.instruct.messages import RawAudio
	from mistral_common.audio import Audio
	from huggingface_hub import hf_hub_download

	from openai import OpenAI

	# Modify OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://<your-server-host>:8000/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	models = client.models.list()
	model = models.data[0].id

	obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
	audio = Audio.from_file(obama_file, strict=False)

	audio = RawAudio.from_audio(audio)
	req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))

	response = client.audio.transcriptions.create(**req)
	print(response)
	```
	</details>