Mungert commited on
Commit
f87ad92
·
verified ·
1 Parent(s): 7d6fec2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +414 -0
README.md ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - hi
6
+ library_name: transformers
7
+ tags:
8
+ - text-to-speech
9
+ - tts
10
+ - hindi
11
+ - english
12
+ - llama
13
+ - audio
14
+ - speech
15
+ - india
16
+ datasets:
17
+ - proprietary
18
+ pipeline_tag: text-to-speech
19
+ co2_eq_emissions:
20
+ emissions: 0
21
+ source: "Not specified"
22
+ training_type: "unknown"
23
+ geographical_location: "unknown"
24
+ ---
25
+
26
+ # <span style="color: #7FFF7F;">Veena GGUF Models</span>
27
+
28
+
29
+ ## <span style="color: #7F7FFF;">Model Generation Details</span>
30
+
31
+ This model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`8846aace`](https://github.com/ggerganov/llama.cpp/commit/8846aace4934ad29651ea61b8c7e3f6b0556e3d2).
32
+
33
+
34
+
35
+
36
+
37
+ ---
38
+
39
+ ## <span style="color: #7FFF7F;">Quantization Beyond the IMatrix</span>
40
+
41
+ I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides.
42
+
43
+ In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here:
44
+ 👉 [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model-converter/tensor_list_builder.py)
45
+
46
+ While this does increase model file size, it significantly improves precision for a given quantization level.
47
+
48
+ ### **I'd love your feedback—have you tried this? How does it perform for you?**
49
+
50
+
51
+
52
+
53
+ ---
54
+
55
+ <a href="https://readyforquantum.com/huggingface_gguf_selection_guide.html" style="color: #7FFF7F;">
56
+ Click here to get info on choosing the right GGUF model format
57
+ </a>
58
+
59
+ ---
60
+
61
+
62
+
63
+ <!--Begin Original Model Card-->
64
+
65
+
66
+ # Veena - Text to Speech for Indian Languages
67
+
68
+ Veena is a state-of-the-art neural text-to-speech (TTS) model specifically designed for Indian languages, developed by Maya Research. Built on a Llama architecture backbone, Veena generates natural, expressive speech in Hindi and English with remarkable quality and ultra-low latency.
69
+
70
+ ## Model Overview
71
+
72
+ **Veena** is a 3B parameter autoregressive transformer model based on the Llama architecture. It is designed to synthesize high-quality speech from text in Hindi and English, including code-mixed scenarios. The model outputs audio at a 24kHz sampling rate using the SNAC neural codec.
73
+
74
+ * **Model type:** Autoregressive Transformer
75
+ * **Base Architecture:** Llama (3B parameters)
76
+ * **Languages:** Hindi, English
77
+ * **Audio Codec:** SNAC @ 24kHz
78
+ * **License:** Apache 2.0
79
+ * **Developed by:** Maya Research
80
+ * **Model URL:** [https://huggingface.co/maya-research/veena](https://huggingface.co/maya-research/veena)
81
+
82
+ ## Key Features
83
+
84
+ * **4 Distinct Voices:** `kavya`, `agastya`, `maitri`, and `vinaya` - each with unique vocal characteristics.
85
+ * **Multilingual Support:** Native Hindi and English capabilities with code-mixed support.
86
+ * **Ultra-Fast Inference:** Sub-80ms latency on H100-80GB GPUs.
87
+ * **High-Quality Audio:** 24kHz output with the SNAC neural codec.
88
+ * **Production-Ready:** Optimized for real-world deployment with 4-bit quantization support.
89
+
90
+ ## How to Get Started with the Model
91
+
92
+ ### Installation
93
+
94
+ To use Veena, you need to install the `transformers`, `torch`, `torchaudio`, `snac`, and `bitsandbytes` libraries.
95
+
96
+ ```bash
97
+ pip install transformers torch torchaudio
98
+ pip install snac bitsandbytes # For audio decoding and quantization
99
+ ```
100
+
101
+ ### Basic Usage
102
+
103
+ The following Python code demonstrates how to generate speech from text using Veena with 4-bit quantization for efficient inference.
104
+
105
+ ```python
106
+ import torch
107
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
108
+ from snac import SNAC
109
+ import soundfile as sf
110
+
111
+ # Model configuration for 4-bit inference
112
+ quantization_config = BitsAndBytesConfig(
113
+ load_in_4bit=True,
114
+ bnb_4bit_quant_type="nf4",
115
+ bnb_4bit_compute_dtype=torch.bfloat16,
116
+ bnb_4bit_use_double_quant=True,
117
+ )
118
+
119
+ # Load model and tokenizer
120
+ model = AutoModelForCausalLM.from_pretrained(
121
+ "maya-research/veena-tts",
122
+ quantization_config=quantization_config,
123
+ device_map="auto",
124
+ trust_remote_code=True,
125
+ )
126
+ tokenizer = AutoTokenizer.from_pretrained("maya-research/veena-tts", trust_remote_code=True)
127
+
128
+ # Initialize SNAC decoder
129
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().cuda()
130
+
131
+ # Control token IDs (fixed for Veena)
132
+ START_OF_SPEECH_TOKEN = 128257
133
+ END_OF_SPEECH_TOKEN = 128258
134
+ START_OF_HUMAN_TOKEN = 128259
135
+ END_OF_HUMAN_TOKEN = 128260
136
+ START_OF_AI_TOKEN = 128261
137
+ END_OF_AI_TOKEN = 128262
138
+ AUDIO_CODE_BASE_OFFSET = 128266
139
+
140
+ # Available speakers
141
+ speakers = ["kavya", "agastya", "maitri", "vinaya"]
142
+
143
+ def generate_speech(text, speaker="kavya", temperature=0.4, top_p=0.9):
144
+ """Generate speech from text using specified speaker voice"""
145
+
146
+ # Prepare input with speaker token
147
+ prompt = f"<spk_{speaker}> {text}"
148
+ prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)
149
+
150
+ # Construct full sequence: [HUMAN] <spk_speaker> text [/HUMAN] [AI] [SPEECH]
151
+ input_tokens = [
152
+ START_OF_HUMAN_TOKEN,
153
+ *prompt_tokens,
154
+ END_OF_HUMAN_TOKEN,
155
+ START_OF_AI_TOKEN,
156
+ START_OF_SPEECH_TOKEN
157
+ ]
158
+
159
+ input_ids = torch.tensor([input_tokens], device=model.device)
160
+
161
+ # Calculate max tokens based on text length
162
+ max_tokens = min(int(len(text) * 1.3) * 7 + 21, 700)
163
+
164
+ # Generate audio tokens
165
+ with torch.no_grad():
166
+ output = model.generate(
167
+ input_ids,
168
+ max_new_tokens=max_tokens,
169
+ do_sample=True,
170
+ temperature=temperature,
171
+ top_p=top_p,
172
+ repetition_penalty=1.05,
173
+ pad_token_id=tokenizer.pad_token_id,
174
+ eos_token_id=[END_OF_SPEECH_TOKEN, END_OF_AI_TOKEN]
175
+ )
176
+
177
+ # Extract SNAC tokens
178
+ generated_ids = output[0][len(input_tokens):].tolist()
179
+ snac_tokens = [
180
+ token_id for token_id in generated_ids
181
+ if AUDIO_CODE_BASE_OFFSET <= token_id < (AUDIO_CODE_BASE_OFFSET + 7 * 4096)
182
+ ]
183
+
184
+ if not snac_tokens:
185
+ raise ValueError("No audio tokens generated")
186
+
187
+ # Decode audio
188
+ audio = decode_snac_tokens(snac_tokens, snac_model)
189
+ return audio
190
+
191
+ def decode_snac_tokens(snac_tokens, snac_model):
192
+ """De-interleave and decode SNAC tokens to audio"""
193
+ if not snac_tokens or len(snac_tokens) % 7 != 0:
194
+ return None
195
+
196
+ # De-interleave tokens into 3 hierarchical levels
197
+ codes_lvl = [[] for _ in range(3)]
198
+ llm_codebook_offsets = [AUDIO_CODE_BASE_OFFSET + i * 4096 for i in range(7)]
199
+
200
+ for i in range(0, len(snac_tokens), 7):
201
+ # Level 0: Coarse (1 token)
202
+ codes_lvl[0].append(snac_tokens[i] - llm_codebook_offsets[0])
203
+ # Level 1: Medium (2 tokens)
204
+ codes_lvl[1].append(snac_tokens[i+1] - llm_codebook_offsets[1])
205
+ codes_lvl[1].append(snac_tokens[i+4] - llm_codebook_offsets[4])
206
+ # Level 2: Fine (4 tokens)
207
+ codes_lvl[2].append(snac_tokens[i+2] - llm_codebook_offsets[2])
208
+ codes_lvl[2].append(snac_tokens[i+3] - llm_codebook_offsets[3])
209
+ codes_lvl[2].append(snac_tokens[i+5] - llm_codebook_offsets[5])
210
+ codes_lvl[2].append(snac_tokens[i+6] - llm_codebook_offsets[6])
211
+
212
+ # Convert to tensors for SNAC decoder
213
+ hierarchical_codes = []
214
+ for lvl_codes in codes_lvl:
215
+ tensor = torch.tensor(lvl_codes, dtype=torch.int32, device=snac_model.device).unsqueeze(0)
216
+ if torch.any((tensor < 0) | (tensor > 4095)):
217
+ raise ValueError("Invalid SNAC token values")
218
+ hierarchical_codes.append(tensor)
219
+
220
+ # Decode with SNAC
221
+ with torch.no_grad():
222
+ audio_hat = snac_model.decode(hierarchical_codes)
223
+
224
+ return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()
225
+
226
+ # --- Example Usage ---
227
+
228
+ # Hindi
229
+ text_hindi = "आज मैंने एक नई तकनीक के बारे में सीखा जो कृत्रिम बुद्धिमत्ता का उपयोग करके मानव जैसी आवाज़ उत्पन्न कर सकती है।"
230
+ audio = generate_speech(text_hindi, speaker="kavya")
231
+ sf.write("output_hindi_kavya.wav", audio, 24000)
232
+
233
+ # English
234
+ text_english = "Today I learned about a new technology that uses artificial intelligence to generate human-like voices."
235
+ audio = generate_speech(text_english, speaker="agastya")
236
+ sf.write("output_english_agastya.wav", audio, 24000)
237
+
238
+ # Code-mixed
239
+ text_mixed = "मैं तो पूरा presentation prepare कर चुका हूं! कल रात को ही मैंने पूरा code base चेक किया।"
240
+ audio = generate_speech(text_mixed, speaker="maitri")
241
+ sf.write("output_mixed_maitri.wav", audio, 24000)
242
+ ```
243
+
244
+ ## Uses
245
+
246
+ Veena is ideal for a wide range of applications requiring high-quality, low-latency speech synthesis for Indian languages, including:
247
+
248
+ * **Accessibility:** Screen readers and voice-enabled assistance for visually impaired users.
249
+ * **Customer Service:** IVR systems, voice bots, and automated announcements.
250
+ * **Content Creation:** Dubbing for videos, e-learning materials, and audiobooks.
251
+ * **Automotive:** In-car navigation and infotainment systems.
252
+ * **Edge Devices:** Voice-enabled smart devices and IoT applications.
253
+
254
+ ## Technical Specifications
255
+
256
+ ### Architecture
257
+
258
+ Veena leverages a 3B parameter transformer-based architecture with several key innovations:
259
+
260
+ * **Base Architecture:** Llama-style autoregressive transformer (3B parameters)
261
+ * **Audio Codec:** SNAC (24kHz) for high-quality audio token generation
262
+ * **Speaker Conditioning:** Special speaker tokens (`<spk_kavya>`, `<spk_agastya>`, `<spk_maitri>`, `<spk_vinaya>`)
263
+ * **Parameter-Efficient Training:** LoRA adaptation with differentiated ranks for attention and FFN modules.
264
+ * **Context Length:** 2048 tokens
265
+
266
+ ### Training
267
+
268
+ #### Training Infrastructure
269
+
270
+ * **Hardware:** 8× NVIDIA H100 80GB GPUs
271
+ * **Distributed Training:** DDP with optimized communication
272
+ * **Precision:** BF16 mixed precision training with gradient checkpointing
273
+ * **Memory Optimization:** 4-bit quantization with NF4 + double quantization
274
+
275
+ #### Training Configuration
276
+
277
+ * **LoRA Configuration:**
278
+ * `lora_rank_attention`: 192
279
+ * `lora_rank_ffn`: 96
280
+ * `lora_alpha`: 2× rank (384 for attention, 192 for FFN)
281
+ * `lora_dropout`: 0.05
282
+ * `target_modules`: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
283
+ * `modules_to_save`: `["embed_tokens"]`
284
+ * **Optimizer Configuration:**
285
+ * `optimizer`: AdamW (8-bit)
286
+ * `optimizer_betas`: (0.9, 0.98)
287
+ * `optimizer_eps`: 1e-5
288
+ * `learning_rate_peak`: 1e-4
289
+ * `lr_scheduler`: cosine
290
+ * `warmup_ratio`: 0.02
291
+ * **Batch Configuration:**
292
+ * `micro_batch_size`: 8
293
+ * `gradient_accumulation_steps`: 4
294
+ * `effective_batch_size`: 256
295
+
296
+ #### Training Data
297
+
298
+ Veena was trained on **proprietary, high-quality datasets** specifically curated for Indian language TTS.
299
+
300
+ * **Data Volume:** 15,000+ utterances per speaker (60,000+ total)
301
+ * **Languages:** Native Hindi and English utterances with code-mixed support
302
+ * **Speaker Diversity:** 4 professional voice artists with distinct characteristics
303
+ * **Audio Quality:** Studio-grade recordings at 24kHz sampling rate
304
+ * **Content Diversity:** Conversational, narrative, expressive, and informational styles
305
+
306
+ **Note:** The training datasets are proprietary and not publicly available.
307
+
308
+ ## Performance Benchmarks
309
+
310
+ | Metric | Value |
311
+ | --------------------- | ------------------------- |
312
+ | Latency (H100-80GB) | \<80ms |
313
+ | Latency (A100-40GB) | \~120ms |
314
+ | Latency (RTX 4090) | \~200ms |
315
+ | Real-time Factor | 0.05x |
316
+ | Throughput | \~170k tokens/s (8×H100) |
317
+ | Audio Quality (MOS) | 4.2/5.0 |
318
+ | Speaker Similarity | 92% |
319
+ | Intelligibility | 98% |
320
+
321
+ ## Risks, Limitations and Biases
322
+
323
+ * **Language Support:** Currently supports only Hindi and English. Performance on other Indian languages is not guaranteed.
324
+ * **Speaker Diversity:** Limited to 4 speaker voices, which may not represent the full diversity of Indian accents and dialects.
325
+ * **Hardware Requirements:** Requires a GPU for real-time or near-real-time inference. CPU performance will be significantly slower.
326
+ * **Input Length:** The model is limited to a maximum input length of 2048 tokens.
327
+ * **Bias:** The model's performance and voice characteristics are a reflection of the proprietary training data. It may exhibit biases present in the data.
328
+
329
+ ## Future Updates
330
+
331
+ We are actively working on expanding Veena's capabilities:
332
+
333
+ * Support for Tamil, Telugu, Bengali, Marathi, and other Indian languages.
334
+ * Additional speaker voices with regional accents.
335
+ * Emotion and prosody control tokens.
336
+ * Streaming inference support.
337
+ * CPU optimization for edge deployment.
338
+
339
+ ## Citing
340
+
341
+ If you use Veena in your research or applications, please cite:
342
+
343
+ ```bibtex
344
+ @misc{veena2025,
345
+ title={Veena: Open Source Text-to-Speech for Indian Languages},
346
+ author={Maya Research Team},
347
+ year={2025},
348
+ publisher={HuggingFace},
349
+ url={[https://huggingface.co/maya-research/veena-tts](https://huggingface.co/maya-research/veena-tts)}
350
+ }
351
+ ```
352
+
353
+ ## Acknowledgments
354
+
355
+ We thank the open-source community and all contributors who made this project possible. Special thanks to the voice artists who provided high-quality recordings for training.
356
+
357
+ <!--End Original Model Card-->
358
+
359
+ ---
360
+
361
+ # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
362
+
363
+ Help me test my **AI-Powered Quantum Network Monitor Assistant** with **quantum-ready security checks**:
364
+
365
+ 👉 [Quantum Network Monitor](https://readyforquantum.com/?assistant=open&utm_source=huggingface&utm_medium=referral&utm_campaign=huggingface_repo_readme)
366
+
367
+
368
+ The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : [Source Code Quantum Network Monitor](https://github.com/Mungert69). You will also find the code I use to quantize the models if you want to do it yourself [GGUFModelBuilder](https://github.com/Mungert69/GGUFModelBuilder)
369
+
370
+ 💬 **How to test**:
371
+ Choose an **AI assistant type**:
372
+ - `TurboLLM` (GPT-4.1-mini)
373
+ - `HugLLM` (Hugginface Open-source models)
374
+ - `TestLLM` (Experimental CPU-only)
375
+
376
+ ### **What I’m Testing**
377
+ I’m pushing the limits of **small open-source models for AI network monitoring**, specifically:
378
+ - **Function calling** against live network services
379
+ - **How small can a model go** while still handling:
380
+ - Automated **Nmap security scans**
381
+ - **Quantum-readiness checks**
382
+ - **Network Monitoring tasks**
383
+
384
+ 🟡 **TestLLM** – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space):
385
+ - ✅ **Zero-configuration setup**
386
+ - ⏳ 30s load time (slow inference but **no API costs**) . No token limited as the cost is low.
387
+ - 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!
388
+
389
+ ### **Other Assistants**
390
+ 🟢 **TurboLLM** – Uses **gpt-4.1-mini** :
391
+ - **It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited.
392
+ - **Create custom cmd processors to run .net code on Quantum Network Monitor Agents**
393
+ - **Real-time network diagnostics and monitoring**
394
+ - **Security Audits**
395
+ - **Penetration testing** (Nmap/Metasploit)
396
+
397
+ 🔵 **HugLLM** – Latest Open-source models:
398
+ - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita.
399
+
400
+ ### 💡 **Example commands you could test**:
401
+ 1. `"Give me info on my websites SSL certificate"`
402
+ 2. `"Check if my server is using quantum safe encyption for communication"`
403
+ 3. `"Run a comprehensive security audit on my server"`
404
+ 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a [Quantum Network Monitor Agent](https://readyforquantum.com/Download/?utm_source=huggingface&utm_medium=referral&utm_campaign=huggingface_repo_readme) to run the .net code on. This is a very flexible and powerful feature. Use with caution!
405
+
406
+ ### Final Word
407
+
408
+ I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is [open source](https://github.com/Mungert69). Feel free to use whatever you find helpful.
409
+
410
+ If you appreciate the work, please consider [buying me a coffee](https://www.buymeacoffee.com/mahadeva) ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
411
+
412
+ I'm also open to job opportunities or sponsorship.
413
+
414
+ Thank you! 😊