Veena - Text to Speech for Indian Languages

Veena is a state-of-the-art neural text-to-speech (TTS) model specifically designed for Indian languages, developed by Maya Research. Built on a Llama architecture backbone, Veena generates natural, expressive speech in Hindi and English with remarkable quality and ultra-low latency.

Model Overview

Veena is a 3B parameter autoregressive transformer model based on the Llama architecture. It is designed to synthesize high-quality speech from text in Hindi and English, including code-mixed scenarios. The model outputs audio at a 24kHz sampling rate using the SNAC neural codec.

Model type: Autoregressive Transformer
Base Architecture: Llama (3B parameters)
Languages: Hindi, English
Audio Codec: SNAC @ 24kHz
License: Apache 2.0
Developed by: Maya Research
Model URL: https://huggingface.co/maya-research/veena

Key Features

4 Distinct Voices: kavya, agastya, maitri, and vinaya - each with unique vocal characteristics.
Multilingual Support: Native Hindi and English capabilities with code-mixed support.
Ultra-Fast Inference: Sub-80ms latency on H100-80GB GPUs.
High-Quality Audio: 24kHz output with the SNAC neural codec.
Production-Ready: Optimized for real-world deployment with 4-bit quantization support.

How to Get Started with the Model

Installation

To use Veena, you need to install the transformers, torch, torchaudio, snac, and bitsandbytes libraries.

pip install transformers torch torchaudio
pip install snac bitsandbytes  # For audio decoding and quantization

Basic Usage

The following Python code demonstrates how to generate speech from text using Veena with 4-bit quantization for efficient inference.

Uses

Veena is ideal for a wide range of applications requiring high-quality, low-latency speech synthesis for Indian languages, including:

Accessibility: Screen readers and voice-enabled assistance for visually impaired users.
Customer Service: IVR systems, voice bots, and automated announcements.
Content Creation: Dubbing for videos, e-learning materials, and audiobooks.
Automotive: In-car navigation and infotainment systems.
Edge Devices: Voice-enabled smart devices and IoT applications.

Technical Specifications

Architecture

Veena leverages a 3B parameter transformer-based architecture with several key innovations:

Base Architecture: Llama-style autoregressive transformer (3B parameters)
Audio Codec: SNAC (24kHz) for high-quality audio token generation
Speaker Conditioning: Special speaker tokens (<spk_kavya>, <spk_agastya>, <spk_maitri>, <spk_vinaya>)
Parameter-Efficient Training: LoRA adaptation with differentiated ranks for attention and FFN modules.
Context Length: 2048 tokens

Training

Training Infrastructure

Hardware: 8× NVIDIA H100 80GB GPUs
Distributed Training: DDP with optimized communication
Precision: BF16 mixed precision training with gradient checkpointing
Memory Optimization: 4-bit quantization with NF4 + double quantization

Training Configuration

LoRA Configuration:
- lora_rank_attention: 192
- lora_rank_ffn: 96
- lora_alpha: 2× rank (384 for attention, 192 for FFN)
- lora_dropout: 0.05
- target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
- modules_to_save: ["embed_tokens"]
Optimizer Configuration:
- optimizer: AdamW (8-bit)
- optimizer_betas: (0.9, 0.98)
- optimizer_eps: 1e-5
- learning_rate_peak: 1e-4
- lr_scheduler: cosine
- warmup_ratio: 0.02
Batch Configuration:
- micro_batch_size: 8
- gradient_accumulation_steps: 4
- effective_batch_size: 256

Training Data

Veena was trained on proprietary, high-quality datasets specifically curated for Indian language TTS.

Data Volume: 15,000+ utterances per speaker (60,000+ total)
Languages: Native Hindi and English utterances with code-mixed support
Speaker Diversity: 4 professional voice artists with distinct characteristics
Audio Quality: Studio-grade recordings at 24kHz sampling rate
Content Diversity: Conversational, narrative, expressive, and informational styles

Note: The training datasets are proprietary and not publicly available.

Performance Benchmarks

Metric	Value
Latency (H100-80GB)	<80ms
Latency (A100-40GB)	~120ms
Latency (RTX 4090)	~200ms
Real-time Factor	0.05x
Throughput	~170k tokens/s (8×H100)
Audio Quality (MOS)	4.2/5.0
Speaker Similarity	92%
Intelligibility	98%

Risks, Limitations and Biases

Language Support: Currently supports only Hindi and English. Performance on other Indian languages is not guaranteed.
Speaker Diversity: Limited to 4 speaker voices, which may not represent the full diversity of Indian accents and dialects.
Hardware Requirements: Requires a GPU for real-time or near-real-time inference. CPU performance will be significantly slower.
Input Length: The model is limited to a maximum input length of 2048 tokens.
Bias: The model's performance and voice characteristics are a reflection of the proprietary training data. It may exhibit biases present in the data.

Future Updates

We are actively working on expanding Veena's capabilities:

Support for Tamil, Telugu, Bengali, Marathi, and other Indian languages.
Additional speaker voices with regional accents.
Emotion and prosody control tokens.
Streaming inference support.
CPU optimization for edge deployment.

Citing

If you use Veena in your research or applications, please cite:

@misc{veena2025,
  title={Veena: Open Source Text-to-Speech for Indian Languages},
  author={Maya Research Team},
  year={2025},
  publisher={HuggingFace},
  url={[https://huggingface.co/maya-research/veena-tts](https://huggingface.co/maya-research/veena-tts)}
}

Acknowledgments

We thank the open-source community and all contributors who made this project possible. Special thanks to the voice artists who provided high-quality recordings for training.

Prince-1
/

Veena-Onnx-Int4