azzurra-voice ๐Ÿ‡ฎ๐Ÿ‡น

azzurra-voice is a state-of-the-art, highly expressive text-to-speech (TTS) model for the Italian language, developed by Cartesia.

This model is the first release from the Azzurra Project, our initiative to build private, personal, and empathetic AI that feels Italian not just in language, but in culture, warmth, and presence. azzurra-voice was trained on tens of thousands of hours of high-quality, diverse Italian speech, capturing a wide range of accents, prosodies, and conversational styles from across Italy.

This model is released to empower researchers, developers, and makers to build more inclusive, local, and human-centered AI applications.

Features

  • Highly Expressive and Natural: Generates speech with natural intonation and emotion, avoiding a robotic tone.
  • Diverse Italian Dataset: Trained on a comprehensive dataset that includes various regional accents and conversational patterns, making the output feel authentic and familiar.
  • Efficient and High-Quality: Optimized to run efficiently while delivering top-tier, 24,000 Hz speech quality.
  • Open and Accessible: Free, open-weight, and easy to integrate using the transformers library.

Usage

Generating speech is straightforward using the Hugging Face transformers library.

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained("cartesia/azzurra-voice")
model = CsmForConditionalGeneration.from_pretrained("cartesia/azzurra-voice").to(device)

text = "La sintesi vocale รจ un processo complesso"
conversation = [
    {"role": "user", "content": [{"type": "text", "text": text}]},
]
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

audio_output = model.generate(**inputs, output_audio=True)
waveform = audio_output[0].cpu().numpy()

sf.write("output.wav", waveform, 24_000)

Model Details

  • Model Architecture: azzurra-voice is a sesame/csm-1b
  • Language: Italian
  • Sample Rate: 24,000 Hz
  • Training Data: The model was trained on a proprietary dataset composed of tens of thousands of hours of high-quality Italian speech. The dataset covers a wide demographic and geographic range within Italy.

License

This model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

This means:

  • You are free to:
    • Share โ€” copy and redistribute the material in any medium or format.
    • Adapt โ€” remix, transform, and build upon the material.
  • Under the following terms:
    • Attribution โ€” You must give appropriate credit, provide a link to the license, and indicate if changes were made.
    • NonCommercial โ€” You may not use the material for commercial purposes.
    • ShareAlike โ€” If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

For the full license text, please visit: https://creativecommons.org/licenses/by-nc-sa/4.0/

Citation

If you use azzurra-voice in your research, please cite it as follows:

@software{Cartesia_Azzurra_Voice_2025,
  author = {Cartesia},
  title = {{azzurra-voice: A State-of-the-Art Open Italian Text-to-Speech Model}},
  month = {8},
  year = {2025},
  publisher = {Cartesia},
  url = {https://huggingface.co/cartesia/azzurra-voice}
}
Downloads last month
-
Safetensors
Model size
1.63B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cartesia/azzurra-voice

Base model

sesame/csm-1b
Finetuned
(22)
this model