azzurra-voice ๐ฎ๐น
azzurra-voice
is a state-of-the-art, highly expressive text-to-speech (TTS) model for the Italian language, developed by Cartesia.
This model is the first release from the Azzurra Project, our initiative to build private, personal, and empathetic AI that feels Italian not just in language, but in culture, warmth, and presence. azzurra-voice
was trained on tens of thousands of hours of high-quality, diverse Italian speech, capturing a wide range of accents, prosodies, and conversational styles from across Italy.
This model is released to empower researchers, developers, and makers to build more inclusive, local, and human-centered AI applications.
Features
- Highly Expressive and Natural: Generates speech with natural intonation and emotion, avoiding a robotic tone.
- Diverse Italian Dataset: Trained on a comprehensive dataset that includes various regional accents and conversational patterns, making the output feel authentic and familiar.
- Efficient and High-Quality: Optimized to run efficiently while delivering top-tier, 24,000 Hz speech quality.
- Open and Accessible: Free, open-weight, and easy to integrate using the
transformers
library.
Usage
Generating speech is straightforward using the Hugging Face transformers
library.
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("cartesia/azzurra-voice")
model = CsmForConditionalGeneration.from_pretrained("cartesia/azzurra-voice").to(device)
text = "La sintesi vocale รจ un processo complesso"
conversation = [
{"role": "user", "content": [{"type": "text", "text": text}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
audio_output = model.generate(**inputs, output_audio=True)
waveform = audio_output[0].cpu().numpy()
sf.write("output.wav", waveform, 24_000)
Model Details
- Model Architecture:
azzurra-voice
is asesame/csm-1b
- Language: Italian
- Sample Rate: 24,000 Hz
- Training Data: The model was trained on a proprietary dataset composed of tens of thousands of hours of high-quality Italian speech. The dataset covers a wide demographic and geographic range within Italy.
License
This model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
This means:
- You are free to:
- Share โ copy and redistribute the material in any medium or format.
- Adapt โ remix, transform, and build upon the material.
- Under the following terms:
- Attribution โ You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- NonCommercial โ You may not use the material for commercial purposes.
- ShareAlike โ If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
For the full license text, please visit: https://creativecommons.org/licenses/by-nc-sa/4.0/
Citation
If you use azzurra-voice
in your research, please cite it as follows:
@software{Cartesia_Azzurra_Voice_2025,
author = {Cartesia},
title = {{azzurra-voice: A State-of-the-Art Open Italian Text-to-Speech Model}},
month = {8},
year = {2025},
publisher = {Cartesia},
url = {https://huggingface.co/cartesia/azzurra-voice}
}
- Downloads last month
- -
Model tree for cartesia/azzurra-voice
Base model
sesame/csm-1b