Add model files

Browse files

Files changed (6) hide show

README.md +160 -3
config.json +108 -0
tokenizer.py +952 -0
xtts-v2.safetensors +3 -0
xtts2_config.py +228 -0
xtts2_modeling.py +1070 -0

README.md CHANGED Viewed

@@ -1,3 +1,160 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+base_model:
+- coqui/XTTS-v2
+---
+# Auralis 🌌
+## Model Details 🛠️
+**Model Name:** Auralis
+**Model Architecture:** Based on [Coqui XTTS-v2](https://huggingface.co/coqui/XTTS-v2)
+**License:**
+- license: Apache 2.0
+- base_model: XTTS-v2 Components [Coqui AI License](https://coqui.ai/cpml)
+**Language Support:** English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Simplified), Hungarian, Korean, Japanese, Hindi
+**Developed by:** [AstraMind.ai](https://www.astramind.ai)
+**GitHub:** [AstraMind AI](https://github.com/astramind-ai/Auralis/tree/main)
+**Primary Use Case:** Text-to-Speech (TTS) generation for real-world applications, including books, dialogues, and multilingual tasks.
+---
+## Model Description 🚀
+Auralis transforms text into natural, high-quality speech with exceptional speed and scalability. It is powered by [Coqui XTTS-v2](https://huggingface.co/coqui/XTTS-v2) and optimized for both consumer-grade and high-performance GPUs. Auralis is designed to meet real-world needs like long-text processing, voice cloning, and concurrent request handling.
+### Key Features:
+- **Warp-Speed Processing:** Generate speech for an entire novel (e.g., Harry Potter) in ~10 minutes.
+- **Hardware Friendly:** Requires <10GB VRAM on a single NVIDIA RTX 3090.
+- **Scalable:** Handles multiple requests simultaneously.
+- **Streaming:** Seamlessly processes long texts in a streaming format.
+- **Custom Voices:** Enables voice cloning from short reference audio.
+---
+## Quick Start ⭐
+```python
+from auralis import TTS, TTSRequest
+# Initialize the model
+tts = TTS().from_pretrained("AstraMindAI/xtts2-gpt")
+# Create a TTS request
+request = TTSRequest(
+    text="Hello Earth! This is Auralis speaking.",
+    speaker_files=["reference.wav"]
+)
+# Generate speech
+output = tts.generate_speech(request)
+output.save("output.wav")
+```
+---
+## Ebook Generation 📚
+Auralis converting ebooks into audio formats at lightning speed. For Python script, check out [ebook_audio_generator.py](https://github.com/astramind-ai/Auralis/blob/main/examples/vocalize_a_ebook.py).
+```python
+def process_book(chapter_file: str, speaker_file: str):
+    # Read chapter
+    with open(chapter_file, 'r') as f:
+        chapter = f.read()
+    # You can pass the whole book, auralis will take care of splitting
+    request = TTSRequest(
+            text=chapter,
+            speaker_files=[speaker_file],
+            audio_config=AudioPreprocessingConfig(
+                enhance_speech=True,
+                normalize=True
+            )
+        )
+    output = tts.generate_speech(request)
+    output.play()
+    output.save("chapter_output.wav")
+# Example usage
+process_book("chapter1.txt", "reference_voice.wav")
+```
+---
+## Intended Use 🌟
+Auralis is designed for:
+- **Content Creators:** Generate audiobooks, podcasts, or voiceovers.
+- **Developers:** Integrate TTS into applications via a simple Python API.
+- **Accessibility**: Providing audio versions of digital content for people with visual or reading difficulties.
+- **Multilingual Scenarios:** Convert text to speech in multiple supported languages.
+---
+## Performance 📊
+**Benchmarks on NVIDIA RTX 3090:**
+- Short phrases (<100 characters): ~1 second
+- Medium texts (<1,000 characters): ~5-10 seconds
+- Full books (~100,000 characters): ~10 minutes
+**Memory Usage:**
+- Base VRAM: ~4GB
+- Peak VRAM: ~10GB
+---
+## Model Features 🛸
+1. **Speed & Efficiency:**
+   - Smart batching for rapid processing of long texts.
+   - Memory-optimized for consumer GPUs.
+2. **Easy Integration:**
+   - Python API with support for synchronous and asynchronous workflows.
+   - Streaming mode for continuous playback during generation.
+3. **Audio Quality Enhancements:**
+   - Background noise reduction.
+   - Voice clarity and volume normalization.
+   - Customizable audio preprocessing.
+4. **Multilingual Support:**
+   - Automatic language detection.
+   - High-quality speech in 15+ languages.
+5. **Customization:**
+   - Voice cloning using short reference clips.
+   - Adjustable parameters for tone, pacing, and language.
+---
+## Limitations & Ethical Considerations ⚠️
+- **Voice Cloning Risks:** Auralis supports voice cloning, which may raise ethical concerns about misuse. Use responsibly and ensure proper consent.
+- **Accent Limitations:** While robust for many languages, accents and intonations may vary based on the input.
+---
+## Citation 📜
+If you use Auralis in your research or projects, please cite:
+```bibtex
+@misc{auralis2024,
+  author = {AstraMind AI},
+  title = {Auralis: High-Performance Text-to-Speech Engine},
+  year = {2024},
+  url = {https://huggingface.co/AstraMindAI/auralis}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,108 @@

+{
+  "model_type": "xtts",
+  "architectures": [
+    "XttsGPT"
+  ],
+  "audio_config": {
+    "fmax": 8000,
+    "fmin": 0,
+    "hop_length": 256,
+    "mel_channels": 80,
+    "mel_norms_file": null,
+    "n_fft": 1024,
+    "output_sample_rate": 24000,
+    "power": 1.0,
+    "sample_rate": 22050,
+    "win_length": 1024
+  },
+  "d_vector_dim": 512,
+  "decoder_input_dim": 1024,
+  "num_chars": 255,
+  "duration_const": 102400,
+  "output_hop_length": 256,
+  "input_sample_rate": 22050,
+  "output_sample_rate": 24000,
+  "gpt": {
+    "model_type": "xtts_gpt"
+  },
+  "gpt_config": {
+    "model_type": "xtts_gpt",
+    "architectures": [
+      "XttsGPT"
+    ],
+    "vocab_size": 7544,
+    "hidden_size": 1024,
+    "num_hidden_layers": 30,
+    "num_attention_heads": 16,
+    "n_inner": 4096,
+    "number_text_tokens": 7544,
+    "num_audio_tokens": 1026,
+    "max_audio_tokens": 605,
+    "start_audio_token": 1024,
+    "stop_audio_token": 1025,
+    "max_text_tokens": 402,
+    "max_prompt_tokens": 70,
+    "activation_function": "gelu_new",
+    "attn_pdrop": 0.1,
+    "layer_norm_epsilon": 1e-05,
+    "initializer_range": 0.02,
+    "use_masking_gt_prompt_approach": true,
+    "use_perceiver_resampler": true,
+    "kv_cache": true,
+    "enable_redaction": false,
+    "reorder_and_upcast_attn": false,
+    "scale_attn_by_inverse_layer_idx": false,
+    "auto_map": {
+      "AutoConfig": "AstraMindAI/xtts2-gpt--gpt_config.XTTSGPTConfig",
+      "AutoModelForCausalLM": "AstraMindAI/xtts2-gpt--xtts2_gpt_modeling.XttsGPT",
+      "AutoTokenizer": "AstraMindAI/xtts2-gpt--tokenizer.XTTSTokenizerFast"
+    },
+    "languages": [
+      "en",
+      "es",
+      "fr",
+      "de",
+      "it",
+      "pt",
+      "pl",
+      "tr",
+      "ru",
+      "nl",
+      "cs",
+      "ar",
+      "zh-cn",
+      "hu",
+      "ko",
+      "ja",
+      "vi"
+    ]
+  },
+  "gpt_code_stride_len": 1024,
+  "cond_d_vector_in_each_upsampling_layer": true,
+  "auto_map": {
+    "AutoConfig": "AstraMindAI/xtts2--xtts2_config.XTTSConfig",
+    "AutoModelForCausalLM": "AstraMindAI/xtts2--xtts2_modeling.Xtts",
+    "AutoTokenizer": "AstraMindAI/xtts2--tokenizer.XTTSTokenizerFast"
+  },
+  "languages": [
+    "en",
+    "es",
+    "fr",
+    "de",
+    "it",
+    "pt",
+    "pl",
+    "tr",
+    "ru",
+    "nl",
+    "cs",
+    "ar",
+    "zh-cn",
+    "hu",
+    "ko",
+    "ja",
+    "vi"
+  ],
+  "tokenizer_file": "",
+  "transformers_version": "4.46.0"
+}

tokenizer.py ADDED Viewed

	@@ -0,0 +1,952 @@

+import re
+from typing import List, Optional, Union, Dict, Any
+from functools import cached_property
+import pypinyin
+import torch
+from hangul_romanize import Transliter
+from hangul_romanize.rule import academic
+from num2words import num2words
+from spacy.lang.ar import Arabic
+from spacy.lang.en import English
+from spacy.lang.es import Spanish
+from spacy.lang.ja import Japanese
+from spacy.lang.zh import Chinese
+from spacy.lang.vi import Vietnamese
+from transformers import PreTrainedTokenizerFast, BatchEncoding
+from transformers.tokenization_utils_base import TruncationStrategy, PaddingStrategy
+from tokenizers import Tokenizer
+from tokenizers.pre_tokenizers import WhitespaceSplit
+from tokenizers.processors import TemplateProcessing
+from auralis.models.xttsv2.components.tts.layers.xtts.zh_num2words import TextNorm as zh_num2words
+import cutlet
+def get_spacy_lang(lang):
+    if lang == "zh":
+        return Chinese()
+    elif lang == "ja":
+        return Japanese()
+    elif lang == "ar":
+        return Arabic()
+    elif lang == "es":
+        return Spanish()
+    elif lang == "vi":
+        return Vietnamese()
+    else:
+        # For most languages, English does the job
+        return English()
+def find_best_split_point(text: str, target_pos: int, window_size: int = 30) -> int:
+    """
+    Find best split point near target position considering punctuation and language markers.
+    added for better sentence splitting in TTS.
+    """
+    # Define split markers by priority
+    markers = [
+        # Strong breaks (longest pause)
+        (r'[.!?؟။။။]+[\s]*', 1.0),  # Periods, exclamation, question (multi-script)
+        (r'[\n\r]+\s*[\n\r]+', 1.0),  # Multiple newlines
+        (r'[:|;；：；][\s]*', 0.9),  # Colons, semicolons (multi-script)
+        # Medium breaks
+        (r'[,，،、][\s]*', 0.8),  # Commas (multi-script)
+        (r'[)}\]）】』»›》\s]+', 0.7),  # Closing brackets/parentheses
+        (r'[-—−]+[\s]*', 0.7),  # Dashes
+        # Weak breaks
+        (r'\s+[&+=/\s]+\s+', 0.6),  # Special characters with spaces
+        (r'[\s]+', 0.5),  # Any whitespace as last resort
+    ]
+    # Calculate window boundaries
+    start = max(0, target_pos - window_size)
+    end = min(len(text), target_pos + window_size)
+    window = text[start:end]
+    best_pos = target_pos
+    best_score = 0
+    for pattern, priority in markers:
+        matches = list(re.finditer(pattern, window))
+        for match in matches:
+            # Calculate position score based on distance from target
+            pos = start + match.end()
+            distance = abs(pos - target_pos)
+            distance_score = 1 - (distance / (window_size * 2))
+            # Combine priority and position scores
+            score = priority * distance_score
+            if score > best_score:
+                best_score = score
+                best_pos = pos
+    return best_pos
+def split_sentence(text: str, lang: str, text_split_length: int = 250) -> List[str]:
+    """
+    Enhanced sentence splitting with language awareness and optimal breakpoints.
+    Args:
+        text: Input text to split
+        lang: Language code
+        text_split_length: Target length for splits
+    Returns:
+        List of text splits optimized for TTS
+    """
+    text = text.strip()
+    if len(text) <= text_split_length:
+        return [text]
+    nlp = get_spacy_lang(lang)
+    if "sentencizer" not in nlp.pipe_names:
+        nlp.add_pipe("sentencizer")
+    # Get base sentences using spaCy
+    doc = nlp(text)
+    sentences = list(doc.sents)
+    splits = []
+    current_split = []
+    current_length = 0
+    for sent in sentences:
+        sentence_text = str(sent).strip()
+        sentence_length = len(sentence_text)
+        # If sentence fits in current split
+        if current_length + sentence_length <= text_split_length:
+            current_split.append(sentence_text)
+            current_length += sentence_length + 1
+        # Handle long sentences
+        elif sentence_length > text_split_length:
+            # Add current split if exists
+            if current_split:
+                splits.append(" ".join(current_split))
+                current_split = []
+                current_length = 0
+            # Split long sentence at optimal points
+            remaining = sentence_text
+            while len(remaining) > text_split_length:
+                split_pos = find_best_split_point(
+                    remaining,
+                    text_split_length,
+                    window_size=30
+                )
+                # Add split and continue with remainder
+                splits.append(remaining[:split_pos].strip())
+                remaining = remaining[split_pos:].strip()
+            # Handle remaining text
+            if remaining:
+                current_split = [remaining]
+                current_length = len(remaining)
+        # Start new split
+        else:
+            splits.append(" ".join(current_split))
+            current_split = [sentence_text]
+            current_length = sentence_length
+    # Add final split if needed
+    if current_split:
+        splits.append(" ".join(current_split))
+    cleaned_sentences = [s[:-1]+' ' if s.endswith('.') else s for s in splits if s] # prevents annoying sounds in italian
+    # Clean up splits
+    return cleaned_sentences
+_whitespace_re = re.compile(r"\s+")
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = {
+    "en": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("mrs", "misess"),
+            ("mr", "mister"),
+            ("dr", "doctor"),
+            ("st", "saint"),
+            ("co", "company"),
+            ("jr", "junior"),
+            ("maj", "major"),
+            ("gen", "general"),
+            ("drs", "doctors"),
+            ("rev", "reverend"),
+            ("lt", "lieutenant"),
+            ("hon", "honorable"),
+            ("sgt", "sergeant"),
+            ("capt", "captain"),
+            ("esq", "esquire"),
+            ("ltd", "limited"),
+            ("col", "colonel"),
+            ("ft", "fort"),
+        ]
+    ],
+    "es": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("sra", "señora"),
+            ("sr", "señor"),
+            ("dr", "doctor"),
+            ("dra", "doctora"),
+            ("st", "santo"),
+            ("co", "compañía"),
+            ("jr", "junior"),
+            ("ltd", "limitada"),
+        ]
+    ],
+    "fr": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("mme", "madame"),
+            ("mr", "monsieur"),
+            ("dr", "docteur"),
+            ("st", "saint"),
+            ("co", "compagnie"),
+            ("jr", "junior"),
+            ("ltd", "limitée"),
+        ]
+    ],
+    "de": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("fr", "frau"),
+            ("dr", "doktor"),
+            ("st", "sankt"),
+            ("co", "firma"),
+            ("jr", "junior"),
+        ]
+    ],
+    "pt": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("sra", "senhora"),
+            ("sr", "senhor"),
+            ("dr", "doutor"),
+            ("dra", "doutora"),
+            ("st", "santo"),
+            ("co", "companhia"),
+            ("jr", "júnior"),
+            ("ltd", "limitada"),
+        ]
+    ],
+    "it": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # ("sig.ra", "signora"),
+            ("sig", "signore"),
+            ("dr", "dottore"),
+            ("st", "santo"),
+            ("co", "compagnia"),
+            ("jr", "junior"),
+            ("ltd", "limitata"),
+        ]
+    ],
+    "pl": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("p", "pani"),
+            ("m", "pan"),
+            ("dr", "doktor"),
+            ("sw", "święty"),
+            ("jr", "junior"),
+        ]
+    ],
+    "ar": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # There are not many common abbreviations in Arabic as in English.
+        ]
+    ],
+    "zh": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # Chinese doesn't typically use abbreviations in the same way as Latin-based scripts.
+        ]
+    ],
+    "cs": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dr", "doktor"),  # doctor
+            ("ing", "inženýr"),  # engineer
+            ("p", "pan"),  # Could also map to pani for woman but no easy way to do it
+            # Other abbreviations would be specialized and not as common.
+        ]
+    ],
+    "ru": [
+        (re.compile("\\b%s\\b" % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("г-жа", "госпожа"),  # Mrs.
+            ("г-н", "господин"),  # Mr.
+            ("д-р", "доктор"),  # doctor
+            # Other abbreviations are less common or specialized.
+        ]
+    ],
+    "nl": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dhr", "de heer"),  # Mr.
+            ("mevr", "mevrouw"),  # Mrs.
+            ("dr", "dokter"),  # doctor
+            ("jhr", "jonkheer"),  # young lord or nobleman
+            # Dutch uses more abbreviations, but these are the most common ones.
+        ]
+    ],
+    "tr": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("b", "bay"),  # Mr.
+            ("byk", "büyük"),  # büyük
+            ("dr", "doktor"),  # doctor
+            # Add other Turkish abbreviations here if needed.
+        ]
+    ],
+    "hu": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dr", "doktor"),  # doctor
+            ("b", "bácsi"),  # Mr.
+            ("nőv", "nővér"),  # nurse
+            # Add other Hungarian abbreviations here if needed.
+        ]
+    ],
+    "ko": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # Korean doesn't typically use abbreviations in the same way as Latin-based scripts.
+        ]
+    ],
+     "vi": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # Vietnamese doesn't typically use abbreviations in the same way as Latin-based scripts.
+        ]
+    ],
+}
+def expand_abbreviations_multilingual(text, lang="en"):
+    if lang in _abbreviations:
+        for regex, replacement in _abbreviations[lang]:
+            text = re.sub(regex, replacement, text)
+    return text
+_symbols_multilingual = {
+    "en": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " and "),
+            ("@", " at "),
+            ("%", " percent "),
+            ("#", " hash "),
+            ("$", " dollar "),
+            ("£", " pound "),
+            ("°", " degree "),
+        ]
+    ],
+    "es": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " y "),
+            ("@", " arroba "),
+            ("%", " por ciento "),
+            ("#", " numeral "),
+            ("$", " dolar "),
+            ("£", " libra "),
+            ("°", " grados "),
+        ]
+    ],
+    "fr": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " et "),
+            ("@", " arobase "),
+            ("%", " pour cent "),
+            ("#", " dièse "),
+            ("$", " dollar "),
+            ("£", " livre "),
+            ("°", " degrés "),
+        ]
+    ],
+    "de": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " und "),
+            ("@", " at "),
+            ("%", " prozent "),
+            ("#", " raute "),
+            ("$", " dollar "),
+            ("£", " pfund "),
+            ("°", " grad "),
+        ]
+    ],
+    "pt": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " e "),
+            ("@", " arroba "),
+            ("%", " por cento "),
+            ("#", " cardinal "),
+            ("$", " dólar "),
+            ("£", " libra "),
+            ("°", " graus "),
+        ]
+    ],
+    "it": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " e "),
+            ("@", " chiocciola "),
+            ("%", " per cento "),
+            ("#", " cancelletto "),
+            ("$", " dollaro "),
+            ("£", " sterlina "),
+            ("°", " gradi "),
+        ]
+    ],
+    "pl": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " i "),
+            ("@", " małpa "),
+            ("%", " procent "),
+            ("#", " krzyżyk "),
+            ("$", " dolar "),
+            ("£", " funt "),
+            ("°", " stopnie "),
+        ]
+    ],
+    "ar": [
+        # Arabic
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " و "),
+            ("@", " على "),
+            ("%", " في المئة "),
+            ("#", " رقم "),
+            ("$", " دولار "),
+            ("£", " جنيه "),
+            ("°", " درجة "),
+        ]
+    ],
+    "zh": [
+        # Chinese
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " 和 "),
+            ("@", " 在 "),
+            ("%", " 百分之 "),
+            ("#", " 号 "),
+            ("$", " 美元 "),
+            ("£", " 英镑 "),
+            ("°", " 度 "),
+        ]
+    ],
+    "cs": [
+        # Czech
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " a "),
+            ("@", " na "),
+            ("%", " procento "),
+            ("#", " křížek "),
+            ("$", " dolar "),
+            ("£", " libra "),
+            ("°", " stupně "),
+        ]
+    ],
+    "ru": [
+        # Russian
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " и "),
+            ("@", " собака "),
+            ("%", " процентов "),
+            ("#", " номер "),
+            ("$", " доллар "),
+            ("£", " фунт "),
+            ("°", " градус "),
+        ]
+    ],
+    "nl": [
+        # Dutch
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " en "),
+            ("@", " bij "),
+            ("%", " procent "),
+            ("#", " hekje "),
+            ("$", " dollar "),
+            ("£", " pond "),
+            ("°", " graden "),
+        ]
+    ],
+    "tr": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " ve "),
+            ("@", " at "),
+            ("%", " yüzde "),
+            ("#", " diyez "),
+            ("$", " dolar "),
+            ("£", " sterlin "),
+            ("°", " derece "),
+        ]
+    ],
+    "hu": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " és "),
+            ("@", " kukac "),
+            ("%", " százalék "),
+            ("#", " kettőskereszt "),
+            ("$", " dollár "),
+            ("£", " font "),
+            ("°", " fok "),
+        ]
+    ],
+    "ko": [
+        # Korean
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " 그리고 "),
+            ("@", " 에 "),
+            ("%", " 퍼센트 "),
+            ("#", " 번호 "),
+            ("$", " 달러 "),
+            ("£", " 파운드 "),
+            ("°", " 도 "),
+        ]
+    ],
+    "vi": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " và "),
+            ("@", " tại "),
+            ("%", " phần trăm "),
+            ("#", " thăng "),
+            ("$", " đô-la "),
+            ("£", " bảng "),
+            ("°", " độ "),
+        ]
+    ],
+}
+def expand_symbols_multilingual(text, lang="en"):
+    if lang in _symbols_multilingual:
+        for regex, replacement in _symbols_multilingual[lang]:
+            text = re.sub(regex, replacement, text)
+            text = text.replace("  ", " ")  # Ensure there are no double spaces
+    return text.strip()
+_ordinal_re = {
+    "en": re.compile(r"([0-9]+)(st|nd|rd|th)"),
+    "es": re.compile(r"([0-9]+)(º|ª|er|o|a|os|as)"),
+    "fr": re.compile(r"([0-9]+)(º|ª|er|re|e|ème)"),
+    "de": re.compile(r"([0-9]+)(st|nd|rd|th|º|ª|\.(?=\s|$))"),
+    "pt": re.compile(r"([0-9]+)(º|ª|o|a|os|as)"),
+    "it": re.compile(r"([0-9]+)(º|°|ª|o|a|i|e)"),
+    "pl": re.compile(r"([0-9]+)(º|ª|st|nd|rd|th)"),
+    "ar": re.compile(r"([0-9]+)(ون|ين|ث|ر|ى)"),
+    "cs": re.compile(r"([0-9]+)\.(?=\s|$)"),  # In Czech, a dot is often used after the number to indicate ordinals.
+    "ru": re.compile(r"([0-9]+)(-й|-я|-е|-ое|-ье|-го)"),
+    "nl": re.compile(r"([0-9]+)(de|ste|e)"),
+    "tr": re.compile(r"([0-9]+)(\.|inci|nci|uncu|üncü|\.)"),
+    "hu": re.compile(r"([0-9]+)(\.|adik|edik|odik|edik|ödik|ödike|ik)"),
+    "ko": re.compile(r"([0-9]+)(번째|번|차|째)"),
+    "vi": re.compile(r"(thứ) ([0-9]+)"),
+}
+_number_re = re.compile(r"[0-9]+")
+# noinspection Annotator
+_currency_re = {
+    "USD": re.compile(r"((\$[0-9\.\,]*[0-9]+)|([0-9\.\,]*[0-9]+\$))"),
+    "GBP": re.compile(r"((£[0-9\.\,]*[0-9]+)|([0-9\.\,]*[0-9]+£))"),
+    "EUR": re.compile(r"(([0-9\.\,]*[0-9]+€)|((€[0-9\.\,]*[0-9]+)))"),
+}
+_comma_number_re = re.compile(r"\b\d{1,3}(,\d{3})*(\.\d+)?\b")
+_dot_number_re = re.compile(r"\b\d{1,3}(\.\d{3})*(\,\d+)?\b")
+_decimal_number_re = re.compile(r"([0-9]+[.,][0-9]+)")
+def _remove_commas(m):
+    text = m.group(0)
+    if "," in text:
+        text = text.replace(",", "")
+    return text
+def _remove_dots(m):
+    text = m.group(0)
+    if "." in text:
+        text = text.replace(".", "")
+    return text
+def _expand_decimal_point(m, lang="en"):
+    amount = m.group(1).replace(",", ".")
+    return num2words(float(amount), lang=lang if lang != "cs" else "cz")
+def _expand_currency(m, lang="en", currency="USD"):
+    amount = float((re.sub(r"[^\d.]", "", m.group(0).replace(",", "."))))
+    full_amount = num2words(amount, to="currency", currency=currency, lang=lang if lang != "cs" else "cz")
+    and_equivalents = {
+        "en": ", ",
+        "es": " con ",
+        "fr": " et ",
+        "de": " und ",
+        "pt": " e ",
+        "it": " e ",
+        "pl": ", ",
+        "cs": ", ",
+        "ru": ", ",
+        "nl": ", ",
+        "ar": ", ",
+        "tr": ", ",
+        "hu": ", ",
+        "ko": ", ",
+        "vi": ", ",
+    }
+    if amount.is_integer():
+        last_and = full_amount.rfind(and_equivalents.get(lang, ", "))
+        if last_and != -1:
+            full_amount = full_amount[:last_and]
+    return full_amount
+def _expand_ordinal(m, lang="en"):
+    return num2words(int(m.group(1)), ordinal=True, lang=lang if lang != "cs" else "cz")
+def _expand_number(m, lang="en"):
+    return num2words(int(m.group(0)), lang=lang if lang != "cs" else "cz")
+def expand_numbers_multilingual(text, lang="en"):
+    if lang == "zh":
+        text = zh_num2words()(text)
+    else:
+        if lang in ["en", "ru"]:
+            text = re.sub(_comma_number_re, _remove_commas, text)
+        else:
+            text = re.sub(_dot_number_re, _remove_dots, text)
+        try:
+            text = re.sub(_currency_re["GBP"], lambda m: _expand_currency(m, lang, "GBP"), text)
+            text = re.sub(_currency_re["USD"], lambda m: _expand_currency(m, lang, "USD"), text)
+            text = re.sub(_currency_re["EUR"], lambda m: _expand_currency(m, lang, "EUR"), text)
+        except Exception as e:
+            pass
+        if lang != "tr":
+            text = re.sub(_decimal_number_re, lambda m: _expand_decimal_point(m, lang), text)
+        if lang in _ordinal_re:
+            text = re.sub(_ordinal_re[lang], lambda m: _expand_ordinal(m, lang), text)
+        text = re.sub(_number_re, lambda m: _expand_number(m, lang), text)
+    return text
+def lowercase(text):
+    return text.lower()
+def collapse_whitespace(text):
+    return re.sub(_whitespace_re, " ", text)
+def multilingual_cleaners(text, lang):
+    text = text.replace('"', "")
+    if lang == "tr":
+        text = text.replace("İ", "i")
+        text = text.replace("Ö", "ö")
+        text = text.replace("Ü", "ü")
+    text = lowercase(text)
+    text = expand_numbers_multilingual(text, lang)
+    text = expand_abbreviations_multilingual(text, lang)
+    text = expand_symbols_multilingual(text, lang=lang)
+    text = collapse_whitespace(text)
+    return text
+def basic_cleaners(text):
+    """Basic pipeline that lowercases and collapses whitespace without transliteration."""
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def chinese_transliterate(text):
+    return "".join(
+        [p[0] for p in pypinyin.pinyin(text, style=pypinyin.Style.TONE3, heteronym=False, neutral_tone_with_five=True)]
+    )
+def japanese_cleaners(text, katsu):
+    text = katsu.romaji(text)
+    text = lowercase(text)
+    return text
+def korean_transliterate(text, transliter):
+    return transliter.translit(text)
+# Fast Tokenizer Class
+class XTTSTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Fast Tokenizer implementation for XTTS model using HuggingFace's PreTrainedTokenizerFast
+    """
+    def __init__(
+            self,
+            vocab_file: str = None,
+            tokenizer_object: Optional[Tokenizer] = None,
+            unk_token: str = "[UNK]",
+            pad_token: str = "[PAD]",
+            bos_token: str = "[START]",
+            eos_token: str = "[STOP]",
+            auto_map: dict = {"AutoTokenizer": ["AstraMindAI/xtts2-gpt--tokenizer.XTTSTokenizerFast", None]},
+            clean_up_tokenization_spaces: bool = True,
+            **kwargs
+    ):
+        if tokenizer_object is None and vocab_file is not None:
+            tokenizer_object = Tokenizer.from_file(vocab_file)
+        if tokenizer_object is not None:
+            # Configure the tokenizer
+            tokenizer_object.pre_tokenizer = WhitespaceSplit()
+            tokenizer_object.post_processor = TemplateProcessing(
+                single=f"{bos_token} $A {eos_token}",
+                special_tokens=[
+                    (bos_token, tokenizer_object.token_to_id(bos_token)),
+                    (eos_token, tokenizer_object.token_to_id(eos_token)),
+                ],
+            )
+        super().__init__(
+            tokenizer_object=tokenizer_object,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs
+        )
+        # Character limits per language
+        self.char_limits = {
+            "en": 250, "de": 253, "fr": 273, "es": 239,
+            "it": 213, "pt": 203, "pl": 224, "zh": 82,
+            "ar": 166, "cs": 186, "ru": 182, "nl": 251,
+            "tr": 226, "ja": 71, "hu": 224, "ko": 95,
+            "vi": 200,
+        }
+        # Initialize language tools
+        self._katsu = None
+        self._korean_transliter = Transliter(academic)
+        # Ensure pad_token_id is set
+        if self.pad_token_id is None:
+            self.pad_token_id = self.tokenizer.token_to_id(self.pad_token)
+    @cached_property
+    def katsu(self):
+        if self._katsu is None:
+            self._katsu = cutlet.Cutlet()
+        return self._katsu
+    def preprocess_text(self, text: str, lang: str) -> str:
+        """Apply text preprocessing for language"""
+        base_lang = lang.split("-")[0]  # remove region
+        if base_lang in {"ar", "cs", "de", "en", "es", "fr", "hu", "it",
+                         "nl", "pl", "pt", "ru", "tr", "zh", "ko", "vi"}:
+            text = multilingual_cleaners(text, base_lang)
+            if base_lang == "zh":
+                text = chinese_transliterate(text)
+            if base_lang == "ko":
+                text = korean_transliterate(text, self._korean_transliter)
+        elif base_lang == "ja":
+            text = japanese_cleaners(text, self.katsu)
+        else:
+            text = basic_cleaners(text)
+        return text
+    def batch_encode_with_split(self, texts: Union[str, List[str]], lang: Union[str, List[str]],
+                                **kwargs) -> torch.Tensor:
+        """
+        Split texts into smaller chunks based on language character limits and encode them using HuggingFace fast tokenizer.
+        strictly mimic the xttsv2 tokenizer
+        """
+        # Convert single inputs to lists
+        if isinstance(texts, str):
+            texts = [texts]
+        if isinstance(lang, str):
+            lang = [lang]
+        # Ensure lang list matches texts list
+        if len(lang) == 1 and len(texts) > 1:
+            lang = lang * len(texts)
+        # Check if texts and lang have the same length
+        if len(texts) != len(lang):
+            raise ValueError(f"Number of texts ({len(texts)}) does not match number of languages ({len(lang)}).")
+        chunk_list = []
+        max_splits = 0
+        # For each text, split into chunks based on character limit
+        for text, text_lang in zip(texts, lang):
+            # Get language character limit
+            base_lang = text_lang.split("-")[0]
+            char_limit = self.char_limits.get(base_lang, 250)
+            # Clean and preprocess
+            #text = self.preprocess_text(text, text_lang) we do this in the hidden function
+            # Split text into sentences/chunks based on language
+            chunk_list = split_sentence(text, base_lang, text_split_length=char_limit)
+        # Ensure the tokenizer is a fast tokenizer
+        if not self.is_fast:
+            raise ValueError("The tokenizer must be a fast tokenizer.")
+        # Encode all chunks using the fast tokenizer
+        encoding: BatchEncoding = self(
+            chunk_list,
+            lang = lang,
+            add_special_tokens=False,
+            padding=False,
+            **kwargs
+        )
+        # The 'input_ids' tensor will have shape [total_chunks, max_sequence_length]
+        return encoding['input_ids']  # Tensor of shape [total_chunks, sequence_length]
+    def _batch_encode_plus(
+            self,
+            batch_text_or_text_pairs,
+            add_special_tokens: bool = True,
+            padding_strategy=PaddingStrategy.DO_NOT_PAD,
+            truncation_strategy=TruncationStrategy.DO_NOT_TRUNCATE,
+            max_length: Optional[int] = None,
+            stride: int = 0,
+            is_split_into_words: bool = False,
+            pad_to_multiple_of: Optional[int] = None,
+            return_tensors: Optional[str] = None,
+            return_token_type_ids: Optional[bool] = None,
+            return_attention_mask: Optional[bool] = None,
+            return_overflowing_tokens: bool = False,
+            return_special_tokens_mask: bool = False,
+            return_offsets_mapping: bool = False,
+            return_length: bool = False,
+            verbose: bool = True,
+            **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Override batch encoding to handle language-specific preprocessing
+        """
+        lang = kwargs.pop("lang", ["en"] * len(batch_text_or_text_pairs))
+        if isinstance(lang, str):
+            lang = [lang]
+        # Ensure lang list matches texts list
+        if len(lang) == 1 and len(batch_text_or_text_pairs) > 1:
+            lang = lang * len(batch_text_or_text_pairs)
+        # Check if batch_text_or_text_pairs and lang have the same length
+        if len(batch_text_or_text_pairs) != len(lang):
+            raise ValueError(f"Number of texts ({len(batch_text_or_text_pairs)}) does not match number of languages ({len(lang)}).")
+        # Preprocess each text in the batch with its corresponding language
+        processed_texts = []
+        for text, text_lang in zip(batch_text_or_text_pairs, lang):
+            if isinstance(text, str):
+                # Check length and preprocess
+                #self.check_input_length(text, text_lang)
+                processed_text = self.preprocess_text(text, text_lang)
+                # Format text with language tag and spaces
+                base_lang = text_lang.split("-")[0]
+                lang_code = "zh-cn" if base_lang == "zh" else base_lang
+                processed_text = f"[{lang_code}]{processed_text}"
+                processed_text = processed_text.replace(" ", "[SPACE]")
+                processed_texts.append(processed_text)
+            else:
+                processed_texts.append(text)
+        # Call the parent class's encoding method with processed texts
+        return super()._batch_encode_plus(
+            processed_texts,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs
+        )
+    def __call__(
+            self,
+            text: Union[str, List[str]],
+            lang: Union[str, List[str]] = "en",
+            add_special_tokens: bool = True,
+            padding: Union[bool, str, PaddingStrategy] = False,
+            truncation: Union[bool, str, TruncationStrategy] = False,
+            max_length: Optional[int] = None,
+            stride: int = 0,
+            return_tensors: Optional[str] = None,
+            return_token_type_ids: Optional[bool] = None,
+            return_attention_mask: Optional[bool] = True,
+            **kwargs
+    ):
+        """
+        Main tokenization method
+        """
+        # Convert single string to list for batch processing
+        if isinstance(text, str):
+            text = [text]
+        if isinstance(lang, str):
+            lang = [lang]
+        # Ensure lang list matches texts list
+        if len(lang) == 1 and len(text) > 1:
+            lang = lang * len(text)
+        # Ensure text and lang lists have same length
+        if len(text) != len(lang):
+            raise ValueError(f"Number of texts ({len(text)}) does not match number of languages ({len(lang)}).")
+        # Convert padding strategy
+        if isinstance(padding, bool):
+            padding_strategy = PaddingStrategy.LONGEST if padding else PaddingStrategy.DO_NOT_PAD
+        else:
+            padding_strategy = PaddingStrategy(padding)
+        # Convert truncation strategy
+        if isinstance(truncation, bool):
+            truncation_strategy = TruncationStrategy.LONGEST_FIRST if truncation else TruncationStrategy.DO_NOT_TRUNCATE
+        else:
+            truncation_strategy = TruncationStrategy(truncation)
+        # Use the batch encoding method
+        encoded = self._batch_encode_plus(
+            text,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            lang=lang,
+            **kwargs
+        )
+        return encoded

xtts-v2.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:130a9659aed2056d094e6d73f31474685d414f98747a61835a153038991e01ef
+size 352299952

xtts2_config.py ADDED Viewed

	@@ -0,0 +1,228 @@

+from dataclasses import asdict, dataclass
+from typing import Dict, Optional, List
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+@dataclass
+class GPTAudioConfig:
+    """Configuration for GPT audio processing parameters"""
+    mel_channels: int = 80
+    sample_rate: int = 22050
+    output_sample_rate: int = 24000
+@dataclass
+class XTTSAudioConfig:
+    """Configuration for audio processing parameters"""
+    sample_rate: int = 22050
+    output_sample_rate: int = 24000
+    mel_channels: int = 80
+    hop_length: int = 256
+    win_length: int = 1024
+    n_fft: int = 1024
+    fmin: int = 0
+    fmax: int = 8000
+    power: float = 1.0
+    mel_norms_file: Optional[str] = None
+class XTTSGPTConfig(PretrainedConfig):
+    """Configuration class for the GPT component of XTTS."""
+    model_type = "xtts_gpt"
+    def __init__(
+            self,
+            # Model architecture
+            hidden_size: int = 1024,  # gpt_n_model_channels in original
+            n_inner: int = 4096,
+            num_hidden_layers: int = 30,  # gpt_layers in original
+            num_attention_heads: int = 16,  # gpt_n_heads in original
+            # Tokenizer settings
+            vocab_size: int = 7544,  # gpt_number_text_tokens in original
+            number_text_tokens: int = 7544,  # Explicit text token vocabulary size
+            start_text_token: Optional[int] = None,
+            stop_text_token: Optional[int] = None,
+            # Audio token settings
+            num_audio_tokens: int = 1026,  # gpt_num_audio_tokens in original
+            start_audio_token: int = 1024,  # gpt_start_audio_token in original
+            stop_audio_token: int = 1025,  # gpt_stop_audio_token in original
+            # Sequence length settings
+            max_audio_tokens: int = 605,  # gpt_max_audio_tokens in original
+            max_text_tokens: int = 402,  # gpt_max_text_tokens in original
+            max_prompt_tokens: int = 70,  # gpt_max_prompt_tokens in original
+            gpt_max_audio_tokens: int = 605,  # Used for generation
+            # Model behavior settings
+            use_masking_gt_prompt_approach: bool = True,  # gpt_use_masking_gt_prompt_approach in original
+            use_perceiver_resampler: bool = True,  # gpt_use_perceiver_resampler in original
+            kv_cache: bool = True,
+            enable_redaction: bool = False,
+            # GPT batch settings
+            gpt_batch_size: int = 1,
+            # Audio processing
+            audio_config: Optional[Dict] = None,
+            # Architecture specifics
+            layer_norm_epsilon: float = 1e-5,
+            initializer_range: float = 0.02,
+            add_cross_attention: bool = False,
+            scale_attn_by_inverse_layer_idx: bool = False,
+            reorder_and_upcast_attn: bool = False,
+            # Size settings for the decoder
+            decoder_input_dim: int = 1024,
+            architectures=["XttsGPT"],
+            auto_map={
+                "AutoConfig": "AstraMindAI/xtts2-gpt--gpt_config.XTTSGPTConfig",
+                "AutoModelForCausalLM": "AstraMindAI/xtts2-gpt--xtts2_gpt_modeling.XttsGPT",
+            },
+            activation_function: str = "gelu",
+            attn_pdrop: float = 0.1,
+            **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.architectures = architectures
+        self.auto_map = auto_map
+        self.audio_config = GPTAudioConfig(
+            **audio_config if audio_config is not None else {}
+        )
+        self.activation_function = activation_function
+        self.attn_pdrop = attn_pdrop
+        self.hidden_size = hidden_size
+        self.n_inner = n_inner
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.vocab_size = vocab_size
+        self.number_text_tokens = number_text_tokens
+        self.start_text_token = start_text_token
+        self.stop_text_token = stop_text_token
+        self.num_audio_tokens = num_audio_tokens
+        self.start_audio_token = start_audio_token
+        self.stop_audio_token = stop_audio_token
+        self.max_audio_tokens = max_audio_tokens
+        self.max_text_tokens = max_text_tokens
+        self.max_prompt_tokens = max_prompt_tokens
+        self.gpt_max_audio_tokens = gpt_max_audio_tokens
+        self.use_masking_gt_prompt_approach = use_masking_gt_prompt_approach
+        self.use_perceiver_resampler = use_perceiver_resampler
+        self.kv_cache = kv_cache
+        self.enable_redaction = enable_redaction
+        self.gpt_batch_size = gpt_batch_size
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.add_cross_attention = add_cross_attention
+        self.scale_attn_by_inverse_layer_idx = scale_attn_by_inverse_layer_idx
+        self.reorder_and_upcast_attn = reorder_and_upcast_attn
+        self.decoder_input_dim = decoder_input_dim
+    def to_dict(self) -> Dict:
+        """Convert the config to a dictionary."""
+        output = super().to_dict()
+        output["audio_config"] = asdict(self.audio_config)
+        return output
+    @classmethod
+    def from_dict(cls, config_dict: Dict, *args, **kwargs) -> "XTTSGPTConfig":
+        """Create a config from a dictionary."""
+        return cls(**config_dict)
+class XTTSConfig(PretrainedConfig):
+    """Configuration class for XTTS model components except GPT."""
+    model_type = "xtts"
+    def __init__(
+            self,
+            # Audio settings
+            audio_config: Optional[Dict] = None,
+            input_sample_rate: int = 22050,
+            output_sample_rate: int = 24000,
+            output_hop_length: int = 256,
+            # Model architecture
+            decoder_input_dim: int = 1024,
+            d_vector_dim: int = 512,
+            cond_d_vector_in_each_upsampling_layer: bool = True,
+            # Training settings
+            gpt_code_stride_len: int = 1024,
+            duration_const: int = 102400,
+            # Tokenizer settings
+            tokenizer_file: str = "",
+            num_chars: int = 255,
+            # Language support
+            languages: Optional[List[str]] = None,
+            # GPT configuration
+            gpt_config: Optional[Dict] = None,
+            architectures=["Xtts"],
+            auto_map = {
+                       "AutoConfig": "AstraMindAI/xtts2--xtts2_config.XTTSConfig",
+                       "AutoModelForCausalLM": "AstraMindAI/xtts2--xtts2_modeling.Xtts",
+                   },
+            **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.architectures = architectures
+        self.auto_map = auto_map
+        # Initialize audio config
+        self.audio_config = XTTSAudioConfig(
+            **audio_config if audio_config is not None else {}
+        )
+        self.input_sample_rate = input_sample_rate
+        self.output_sample_rate = output_sample_rate
+        self.output_hop_length = output_hop_length
+        self.decoder_input_dim = decoder_input_dim
+        self.d_vector_dim = d_vector_dim
+        self.cond_d_vector_in_each_upsampling_layer = cond_d_vector_in_each_upsampling_layer
+        self.gpt_code_stride_len = gpt_code_stride_len
+        self.duration_const = duration_const
+        self.tokenizer_file = tokenizer_file
+        self.num_chars = num_chars
+        # Initialize GPT config
+        self.gpt = XTTSGPTConfig(**gpt_config if gpt_config is not None else {})
+        if languages is None:
+            self.languages = [
+                "en", "es", "fr", "de", "it", "pt", "pl", "tr", "ru",
+                "nl", "cs", "ar", "zh-cn", "hu", "ko", "ja", "hi", "vi",
+            ]
+        else:
+            self.languages = languages
+    def to_dict(self) -> Dict:
+        """Convert the config to a dictionary."""
+        output = super().to_dict()
+        output["audio_config"] = asdict(self.audio_config)
+        output["gpt_config"] = self.gpt.to_dict()
+        return output
+    @classmethod
+    def from_dict(cls, config_dict: Dict, *args, **kwargs) -> "XTTSConfig":
+        """Create a config from a dictionary."""
+        if "gpt_config" in config_dict:
+            gpt_config = config_dict["gpt_config"]
+            config_dict = {k: v for k, v in config_dict.items() if k != "gpt_config"}
+            return cls(gpt_config=gpt_config, **config_dict)
+        return cls(**config_dict)

xtts2_modeling.py ADDED Viewed

	@@ -0,0 +1,1070 @@

+import asyncio
+import functools
+import logging
+import random
+import time
+import uuid
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional, List, Tuple, Union, AsyncGenerator, Dict, Any
+from concurrent.futures import ThreadPoolExecutor
+import librosa
+import torch
+import numpy as np
+import torchaudio
+import sounddevice as sd
+import io
+from torch import nn
+from IPython.display import Audio, display
+from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams, TokensPrompt, RequestOutput
+from vllm.multimodal import MultiModalDataDict
+from vllm.utils import Counter
+from TTS.TTS.tts.layers.xtts.hifigan_decoder import HifiDecoder
+from TTS.tts.layers.xtts.latent_encoder import ConditioningEncoder  # noqa
+from TTS.tts.layers.xtts.perceiver_encoder import PerceiverResampler  # noqa
+from .xtts2_config import XTTSConfig, XTTSGPTConfig
+from .tokenizer import XTTSTokenizerFast
+from ..xtts2_gpt.xtts2_gpt_modeling import LearnedPositionEmbeddings
+def wav_to_mel_cloning(
+        wav,
+        mel_norms_file="../experiments/clips_mel_norms.pth",
+        mel_norms=None,
+        device=torch.device("cpu"),
+        n_fft=4096,
+        hop_length=1024,
+        win_length=4096,
+        power=2,
+        normalized=False,
+        sample_rate=22050,
+        f_min=0,
+        f_max=8000,
+        n_mels=80,
+):
+    mel_stft = torchaudio.transforms.MelSpectrogram(
+        n_fft=n_fft,
+        hop_length=hop_length,
+        win_length=win_length,
+        power=power,
+        normalized=normalized,
+        sample_rate=sample_rate,
+        f_min=f_min,
+        f_max=f_max,
+        n_mels=n_mels,
+        norm="slaney",
+    ).to(device)
+    wav = wav.to(device)
+    mel = mel_stft(wav)
+    mel = torch.log(torch.clamp(mel, min=1e-5))
+    if mel_norms is None:
+        mel_norms = torch.load(mel_norms_file, map_location=device)
+    mel = mel / mel_norms.unsqueeze(0).unsqueeze(-1)
+    return mel
+def load_audio(audiopath, sampling_rate):
+    audio, lsr = torchaudio.load(audiopath)
+    # Stereo to mono if needed
+    if audio.size(0) != 1:
+        audio = torch.mean(audio, dim=0, keepdim=True)
+    if lsr != sampling_rate:
+        audio = torchaudio.functional.resample(audio, lsr, sampling_rate)
+    # Clip audio invalid values
+    audio.clip_(-1, 1)
+    return audio
+@dataclass
+class XTTSRequest:
+    """Container for XTTS inference request data"""
+    request_id: str
+    text: Union[AsyncGenerator[str, None], str]
+    language: str
+    speaker_file: str  # Path to the speaker audio file
+    generate_every_n_chars: Optional[int] = None
+    temperature: float = 0.75
+    top_p: float = 0.85
+    top_k: int = 50
+    repetition_penalty: float = 5.0
+    length_penalty: float = 1.0
+    do_sample: bool = True
+    max_ref_length: int = 60
+    gpt_cond_len: int = 30
+    gpt_cond_chunk_len: int = 4
+import threading
+class HiddenStatesCollector:
+    def __init__(self):
+        self.outputs = {}
+        self.lock = threading.Lock()
+    def __call__(self, outputs: Optional[torch.Tensor], request_id: str):
+        """Save outputs for a specific request"""
+        with self.lock:
+            if request_id not in self.outputs:
+                self.outputs[request_id] = []
+            self.outputs[request_id].append(outputs)
+    def get_hidden_states(self, request_id) -> Optional[torch.Tensor]:
+        with self.lock:
+            outputs = self.outputs.pop(request_id, None)
+        if outputs is not None:
+            outputs = torch.cat(outputs, dim=0)
+        return outputs
+    def bind_to_request(self, request_id: str):
+        def bound_collector(outputs: Optional[torch.Tensor], _request_id: str = None):
+            self(outputs, request_id)
+        return bound_collector
+class ExtendedSamplingParams(SamplingParams, kw_only=True):
+    """Extended sampling parameters that allows additional fields while maintaining compatibility with SamplingParams.
+    This class inherits from SamplingParams and allows adding new required fields
+    without conflicting with the base class's optional fields ordering.
+    """
+    hidden_state_collector: HiddenStatesCollector  # New required field
+class LogitsRepetitionPenalizer:
+    """A logits processor that applies repetition penalty to prevent repetitive text generation."""
+    def __init__(self, repetition_penalty: float):
+        if repetition_penalty < 0:
+            raise ValueError("Repetition penalty must be non-negative")
+        self.repetition_penalty = repetition_penalty
+    def __call__(self, token_ids: List[int], logits: torch.Tensor) -> torch.Tensor:
+        """Apply repetition penalty to the logits based on previous tokens."""
+        # If no repetition penalty or no tokens to check, return original logits
+        if self.repetition_penalty == 1.0 or not token_ids:
+            return logits
+        # Create a mask for the repeated tokens
+        repeated_tokens = torch.tensor(token_ids,
+                                       device=logits.device,
+                                       dtype=torch.long)
+        # Get logits of repeated tokens
+        repeated_logits = logits[repeated_tokens]
+        # Apply penalty: divide positive logits by penalty, multiply negative logits by penalty
+        repeated_logits = torch.where(
+            repeated_logits > 0,
+            repeated_logits / self.repetition_penalty,
+            repeated_logits * self.repetition_penalty
+        )
+        # Update only the logits for repeated tokens
+        logits[repeated_tokens] = repeated_logits
+        return logits
+@dataclass
+class XTTSOutput:
+    """Container for XTTS inference output with integrated audio utilities"""
+    request_id: str
+    wav: np.ndarray
+    sample_rate: int = 24000
+    def to_tensor(self) -> torch.Tensor:
+        """Convert numpy array to torch tensor"""
+        if isinstance(self.wav, np.ndarray):
+            return torch.from_numpy(self.wav)
+        return self.wav
+    def to_bytes(self, format: str = 'wav', sample_width: int = 2) -> bytes:
+        """Convert audio to bytes format.
+        Args:
+            format: Output format ('wav' or 'raw')
+            sample_width: Bit depth (1, 2, or 4 bytes per sample)
+        Returns:
+            Audio data as bytes
+        """
+        # Convert to tensor if needed
+        wav_tensor = self.to_tensor()
+        # Ensure correct shape (1, N) for torchaudio
+        if wav_tensor.dim() == 1:
+            wav_tensor = wav_tensor.unsqueeze(0)
+        # Normalize to [-1, 1]
+        wav_tensor = torch.clamp(wav_tensor, -1.0, 1.0)
+        if format == 'wav':
+            buffer = io.BytesIO()
+            torchaudio.save(
+                buffer,
+                wav_tensor,
+                self.sample_rate,
+                format="wav",
+                encoding="PCM_S" if sample_width == 2 else "PCM_F",
+                bits_per_sample=sample_width * 8
+            )
+            return buffer.getvalue()
+        elif format == 'raw':
+            # Scale to appropriate range based on sample width
+            if sample_width == 2:  # 16-bit
+                wav_tensor = (wav_tensor * 32767).to(torch.int16)
+            elif sample_width == 4:  # 32-bit
+                wav_tensor = (wav_tensor * 2147483647).to(torch.int32)
+            else:  # 8-bit
+                wav_tensor = (wav_tensor * 127).to(torch.int8)
+            return wav_tensor.cpu().numpy().tobytes()
+        else:
+            raise ValueError(f"Unsupported format: {format}")
+    def save(self,
+             filename: Union[str, Path],
+             sample_rate: Optional[int] = None,
+             format: Optional[str] = None) -> None:
+        """Save audio to file.
+        Args:
+            filename: Output filename
+            sample_rate: Optional new sample rate for resampling
+            format: Optional format override (default: inferred from extension)
+        """
+        wav_tensor = self.to_tensor()
+        if wav_tensor.dim() == 1:
+            wav_tensor = wav_tensor.unsqueeze(0)
+        # Resample if needed
+        if sample_rate and sample_rate != self.sample_rate:
+            wav_tensor = torchaudio.functional.resample(
+                wav_tensor,
+                orig_freq=self.sample_rate,
+                new_freq=sample_rate
+            )
+        else:
+            sample_rate = self.sample_rate
+        torchaudio.save(
+            filename,
+            wav_tensor,
+            sample_rate,
+            format=format
+        )
+    def resample(self, new_sample_rate: int) -> 'XTTSOutput':
+        """Create new XTTSOutput with resampled audio.
+        Args:
+            new_sample_rate: Target sample rate
+        Returns:
+            New XTTSOutput instance with resampled audio
+        """
+        wav_tensor = self.to_tensor()
+        if wav_tensor.dim() == 1:
+            wav_tensor = wav_tensor.unsqueeze(0)
+        resampled = torchaudio.functional.resample(
+            wav_tensor,
+            orig_freq=self.sample_rate,
+            new_freq=new_sample_rate
+        )
+        return XTTSOutput(
+            request_id=self.request_id,
+            wav=resampled.squeeze().numpy(),
+            sample_rate=new_sample_rate
+        )
+    def get_info(self) -> Tuple[int, int, float]:
+        """Get audio information.
+        Returns:
+            Tuple of (number of samples, sample rate, duration in seconds)
+        """
+        n_samples = len(self.wav)
+        duration = n_samples / self.sample_rate
+        return n_samples, self.sample_rate, duration
+    @classmethod
+    def from_tensor(cls, request_id: str, tensor: torch.Tensor, sample_rate: int = 24000) -> 'XTTSOutput':
+        """Create XTTSOutput from torch tensor.
+        Args:
+            request_id: Request identifier
+            tensor: Audio tensor
+            sample_rate: Sample rate of the audio
+        Returns:
+            New XTTSOutput instance
+        """
+        return cls(
+            request_id=request_id,
+            wav=tensor.squeeze().cpu().numpy(),
+            sample_rate=sample_rate
+        )
+    @classmethod
+    def from_file(cls, request_id: str, filename: Union[str, Path]) -> 'XTTSOutput':
+        """Create XTTSOutput from audio file.
+        Args:
+            request_id: Request identifier
+            filename: Path to audio file
+        Returns:
+            New XTTSOutput instance
+        """
+        wav_tensor, sample_rate = torchaudio.load(filename)
+        return cls.from_tensor(request_id, wav_tensor, sample_rate)
+    def play(self) -> None:
+        """Play the audio through the default sound device.
+        For use in regular Python scripts/applications."""
+        # Ensure the audio is in the correct format
+        if isinstance(self.wav, torch.Tensor):
+            audio_data = self.wav.cpu().numpy()
+        else:
+            audio_data = self.wav
+        # Ensure float32 and normalize
+        if audio_data.dtype != np.float32:
+            audio_data = audio_data.astype(np.float32)
+        audio_data = np.clip(audio_data, -1.0, 1.0)
+        # Play the audio
+        sd.play(audio_data, self.sample_rate)
+        sd.wait()  # Wait until the audio is finished playing
+    def display(self) -> Optional[Audio]:
+        """Display audio player in Jupyter notebook.
+        Returns Audio widget if in notebook, None otherwise."""
+        try:
+            # Convert to bytes
+            audio_bytes = self.to_bytes(format='wav')
+            # Create and display audio widget
+            audio_widget = Audio(audio_bytes, rate=self.sample_rate, autoplay=False)
+            display(audio_widget)
+            return audio_widget
+        except Exception as e:
+            print(f"Could not display audio widget: {str(e)}")
+            print("Try using .play() method instead")
+            return None
+    def preview(self) -> None:
+        """Smart play method that chooses appropriate playback method."""
+        try:
+            # Try notebook display first
+            if self.display() is None:
+                # Fall back to sounddevice if not in notebook
+                self.play()
+        except Exception as e:
+            print(f"Error playing audio: {str(e)}")
+class Xtts(nn.Module):
+    """Async XTTS model implementation using VLLM's AsyncEngine."""
+    def __init__(self, hifi_config: XTTSConfig, gpt_config: XTTSGPTConfig, tensor_parallel_size: int = 1, **kwargs):
+        super().__init__()
+        self.hifi_config = hifi_config
+        self.gpt_config = gpt_config
+        self.mel_bos_token_id = gpt_config.start_audio_token
+        self.mel_eos_token_id = gpt_config.stop_audio_token
+        self.tp = tensor_parallel_size
+        self.tokenizer = XTTSTokenizerFast.from_pretrained("AstraMindAI/xtts2-gpt")
+        self.request_counter = Counter()
+        self.executor = ThreadPoolExecutor(max_workers=4)  # For CPU-bound tasks
+        self.hidden_states_collector = HiddenStatesCollector()
+        # Register buffer before creating modules
+        self.register_buffer("mel_stats", torch.ones(80))
+        # Initialize all nn.Module components
+        self.conditioning_encoder = ConditioningEncoder(
+            gpt_config.audio_config.mel_channels,
+            gpt_config.hidden_size,
+            num_attn_heads=gpt_config.num_attention_heads
+        )
+        self.text_embedding = nn.Embedding(
+            gpt_config.number_text_tokens,
+            gpt_config.hidden_size
+        )
+        self.text_pos_embedding = (
+            LearnedPositionEmbeddings(
+                gpt_config.max_text_tokens + 2,
+                gpt_config.hidden_size,
+                supports_pp=False
+            )
+            if gpt_config.max_audio_tokens != -1
+            else functools.partial(gpt_config.null_position_embeddings, dim=gpt_config.hidden_size)
+        )
+        if gpt_config.use_perceiver_resampler:
+            self.conditioning_perceiver = PerceiverResampler(
+                dim=gpt_config.hidden_size,
+                depth=2,
+                dim_context=gpt_config.hidden_size,
+                num_latents=32,
+                dim_head=64,
+                heads=8,
+                ff_mult=4,
+                use_flash_attn=False,
+            )
+        # Initialize HiFi-GAN decoder
+        self.hifigan_decoder = HifiDecoder(
+            input_sample_rate=self.hifi_config.input_sample_rate,
+            output_sample_rate=self.hifi_config.output_sample_rate,
+            output_hop_length=self.hifi_config.output_hop_length,
+            ar_mel_length_compression=self.hifi_config.gpt_code_stride_len,
+            decoder_input_dim=self.hifi_config.decoder_input_dim,
+            d_vector_dim=self.hifi_config.d_vector_dim,
+            cond_d_vector_in_each_upsampling_layer=self.hifi_config.cond_d_vector_in_each_upsampling_layer,
+        )
+        # Kept for model loading purposes
+        self.text_head = nn.Linear(gpt_config.hidden_size, gpt_config.number_text_tokens, bias=True)
+        self.final_norm = nn.LayerNorm(gpt_config.hidden_size, eps=1e-5, bias=True)
+        # Initialize VLLM engine at the end
+        self.init_vllm_engine()
+        # Semaphore for concurrency control
+        self.max_concurrency = 10
+        self.semaphore = asyncio.BoundedSemaphore(self.max_concurrency)
+    def half(self):
+        # We cannot permit downcasting since it will throw an error while padding
+        return
+    def to(self, *args, **kwargs):
+        # Block downcasting
+        dtype = kwargs.get('dtype', None)
+        if dtype == torch.float16 or dtype == torch.bfloat16:
+            kwargs['dtype'] = torch.float32
+        elif len(args) > 0 and (args[0] == torch.float16 or args[0] == torch.bfloat16):
+            args = list(args)
+            args[0] = torch.float32
+            args = tuple(args)
+        return super().to(*args, **kwargs)
+    @property
+    def device(self):
+        """Get the current device of the model."""
+        return next(self.parameters()).device
+    @property
+    def dtype(self):
+        """Get the current dtype of the model."""
+        return next(self.parameters()).dtype
+    @staticmethod
+    def get_memory_percentage(memory: int) -> float:
+        """Get memory percentage."""
+        total_memory = torch.cuda.get_device_properties(0).total_memory
+        reserved_memory = torch.cuda.memory_reserved(0)
+        allocated_memory = torch.cuda.memory_allocated(0)
+        available_memory = total_memory - reserved_memory - allocated_memory
+        return memory / available_memory
+    def init_vllm_engine(self):
+        """Initialize models with AsyncVLLMEngine."""
+        engine_args = AsyncEngineArgs(
+            model="AstraMindAI/xtts2-gpt",
+            tensor_parallel_size=self.tp,
+            dtype="auto",
+            disable_log_stats=True,
+            max_model_len=self.gpt_config.max_text_tokens + self.gpt_config.max_audio_tokens,
+            gpu_memory_utilization=self.get_memory_percentage(3 * 1024 ** 3),
+            trust_remote_code=True,
+            enforce_eager=True,
+            limit_mm_per_prompt={"audio": 1},
+            max_num_batched_tokens=7296,
+        )
+        self.llm_engine = AsyncLLMEngine.from_engine_args(engine_args)
+    @classmethod
+    def from_pretrained(
+            cls,
+            pretrained_model_name_or_path: str,
+            torch_dtype: torch.dtype = torch.float32,
+            device_map: Optional[str] = "auto",
+            tensor_parallel_size: int = 1,
+            **kwargs,
+    ) -> "Xtts":
+        """Load pretrained XTTS model from HuggingFace Hub."""
+        from huggingface_hub import hf_hub_download
+        import json
+        import os
+        # Download and load configs
+        if not os.path.exists(pretrained_model_name_or_path):
+            config_file = hf_hub_download(
+                repo_id=pretrained_model_name_or_path,
+                filename="config.json"
+            )
+            with open(config_file, 'r') as f:
+                config = json.load(f)
+        else:
+            # Load from local path
+            with open(os.path.join(pretrained_model_name_or_path, "config.json"), 'r') as f:
+                config = json.load(f)
+        # Initialize configs
+        gpt_config = XTTSGPTConfig(**config['gpt_config'])
+        hifi_config = XTTSConfig(**config)
+        # Initialize model
+        model = cls(
+            hifi_config=hifi_config,
+            gpt_config=gpt_config,
+            tensor_parallel_size=tensor_parallel_size,
+            **kwargs
+        )
+        # Load model weights
+        if not os.path.exists(pretrained_model_name_or_path):
+            hifigan_weights = hf_hub_download(
+                repo_id=pretrained_model_name_or_path,
+                filename="xtts-v2.safetensors"
+            )
+        else:
+            hifigan_weights = os.path.join(pretrained_model_name_or_path, "xtts-v2.safetensors")
+        import safetensors.torch
+        # Load HiFi-GAN weights
+        hifigan_state = safetensors.torch.load_file(hifigan_weights)
+        model.load_state_dict(hifigan_state)
+        # Set model properties
+        model.config = config
+        # Cast model to specified dtype
+        model = model.to(torch_dtype)
+        model = model.to('cuda')
+        return model
+    @staticmethod
+    def load_audio(audio_path: Union[str, Path], sampling_rate: int = 22050) -> torch.Tensor:
+        audio, lsr = torchaudio.load(audio_path)
+        # Stereo to mono if needed
+        if audio.size(0) != 1:
+            audio = torch.mean(audio, dim=0, keepdim=True)
+        if lsr != sampling_rate:
+            audio = torchaudio.functional.resample(audio, lsr, sampling_rate)
+        # Clip audio invalid values
+        audio.clip_(-1, 1)
+        return audio
+    @torch.inference_mode()
+    def get_speaker_embedding(self, audio, sr):
+        audio_16k = torchaudio.functional.resample(audio, sr, 16000)
+        return (
+            self.hifigan_decoder.speaker_encoder.forward(audio_16k.to(self.device), l2_norm=True)
+            .unsqueeze(-1)
+            .to(self.device)
+        )
+    @torch.inference_mode()
+    def get_gpt_cond_latents(self, audio, sr, length: int = 30, chunk_length: int = 6):
+        """Compute the conditioning latents for the GPT model from the given audio."""
+        if sr != 22050:
+            audio = torchaudio.functional.resample(audio, sr, 22050)
+        if length > 0:
+            audio = audio[:, : 22050 * length]
+        if self.gpt_config.use_perceiver_resampler:
+            style_embs = []
+            for i in range(0, audio.shape[1], 22050 * chunk_length):
+                audio_chunk = audio[:, i: i + 22050 * chunk_length]
+                # if the chunk is too short ignore it
+                if audio_chunk.size(-1) < 22050 * 0.33:
+                    continue
+                mel_chunk = wav_to_mel_cloning(
+                    audio_chunk,
+                    mel_norms=self.mel_stats.cpu(),
+                    n_fft=2048,
+                    hop_length=256,
+                    win_length=1024,
+                    power=2,
+                    normalized=False,
+                    sample_rate=22050,
+                    f_min=0,
+                    f_max=8000,
+                    n_mels=80,
+                )
+                style_emb = self.get_style_emb(mel_chunk.to(self.device), None)
+                style_embs.append(style_emb)
+            # mean style embedding
+            cond_latent = torch.stack(style_embs).mean(dim=0)
+        else:
+            mel = wav_to_mel_cloning(
+                audio,
+                mel_norms=self.mel_stats.cpu(),
+                n_fft=4096,
+                hop_length=1024,
+                win_length=4096,
+                power=2,
+                normalized=False,
+                sample_rate=22050,
+                f_min=0,
+                f_max=8000,
+                n_mels=80,
+            )
+            cond_latent = self.get_style_emb(mel.to(self.device))
+        return cond_latent.transpose(1, 2)
+    @torch.inference_mode()
+    def get_conditioning_latents(
+            self,
+            audio_path,
+            max_ref_length=30,
+            gpt_cond_len=6,
+            gpt_cond_chunk_len=6,
+            librosa_trim_db=None,
+            sound_norm_refs=False,
+            load_sr=22050,
+    ):
+        """Get the conditioning latents for the GPT model from the given audio."""
+        # Deal with multiple references
+        assert isinstance(audio_path, str) or isinstance(audio_path, list), "audio_path must be a string or a list."
+        if not isinstance(audio_path, list):
+            audio_paths = [audio_path]
+        else:
+            audio_paths = audio_path
+        speaker_embeddings = []
+        audios = []
+        for file_path in audio_paths:
+            audio = load_audio(file_path, load_sr)
+            audio = audio[:, : load_sr * max_ref_length].to(self.device).to(self.dtype)
+            if sound_norm_refs:
+                audio = (audio / torch.abs(audio).max()) * 0.75
+            if librosa_trim_db is not None:
+                audio = librosa.effects.trim(audio, top_db=librosa_trim_db)[0]
+            # Compute latents for the decoder
+            speaker_embedding = self.get_speaker_embedding(audio, load_sr)
+            speaker_embeddings.append(speaker_embedding)
+            audios.append(audio)
+        # Merge all the audios and compute the latents for the GPT
+        full_audio = torch.cat(audios, dim=-1)
+        gpt_cond_latents = self.get_gpt_cond_latents(
+            full_audio, load_sr, length=gpt_cond_len, chunk_length=gpt_cond_chunk_len
+        )  # [1, 1024, T]
+        speaker_embedding = torch.stack(speaker_embeddings)
+        speaker_embedding = speaker_embedding.mean(dim=0)
+        return gpt_cond_latents, speaker_embedding
+    def get_style_emb(self, cond_input: torch.Tensor, return_latent: bool = False) -> torch.Tensor:
+        """Get conditioning embeddings from mel spectrograms."""
+        if not return_latent:
+            if cond_input.ndim == 4:
+                cond_input = cond_input.squeeze(1)
+            conds = self.conditioning_encoder(cond_input)
+            if hasattr(self, 'conditioning_perceiver'):
+                conds = self.conditioning_perceiver(
+                    conds.permute(0, 2, 1)
+                ).transpose(1, 2)
+        else:
+            conds = cond_input.unsqueeze(1)
+        return conds
+    async def prepare_text_tokens_async(self, text: str, language: str, split_text=False) \
+            -> Tuple[List[Union[int, List[int]]], List[torch.Tensor]]:
+        """Prepare text tokens for the given text and language."""
+        async def elaborate_tokens(text_tokens: List[int]) -> torch.Tensor:
+            text_tokens.insert(0, self.tokenizer.bos_token_id)
+            text_tokens.append(self.tokenizer.eos_token_id)
+            return torch.tensor(text_tokens).unsqueeze(0).to(self.text_embedding.weight.device)
+        async def embed_tokens(text_tokens: Union[torch.Tensor, List[torch.Tensor]]) -> List[torch.Tensor]:
+            embeds = []
+            if isinstance(text_tokens, list):
+                for list_element in text_tokens:
+                    embeds.append(self.text_embedding(list_element) + self.text_pos_embedding(list_element))
+            else:
+                embeds.append(self.text_embedding(text_tokens) + self.text_pos_embedding(text_tokens))
+            return embeds
+        fake_tokens_for_audio_generation = []
+        if split_text:
+            text_tokens = self.tokenizer.batch_encode_with_split(text, lang=[language])
+            for idx, text_token in enumerate(text_tokens):
+                text_tokens[idx] = await elaborate_tokens(text_token)
+                fake_tokens_for_audio_generation.append([1] * len(text_token))
+        else:
+            text_tokens = self.tokenizer.batch_encode(text, lang=[language])
+            text_tokens = await elaborate_tokens(text_tokens)
+            fake_tokens_for_audio_generation = [1] * len(text_tokens)
+        return fake_tokens_for_audio_generation, await embed_tokens(text_tokens)
+    async def prepare_inputs_async(self, text: str, language: str, speaker_file: Union[str, Path],
+                                   max_ref_length: int, gpt_cond_len: int, gpt_cond_chunk_len: int, split_text: bool) \
+            -> Tuple[List[List[int]], List[torch.Tensor], torch.Tensor]:
+        """Prepare input text with conditioning tokens. Return combined conditioning latents"""
+        # Tokenize text based on the language
+        text_tokens, text_embeddings = await self.prepare_text_tokens_async(text, language, split_text)
+        # Load the speaker file and convert it to a tensor
+        gpt_cond_latent, speaker_embeddings = await self.get_conditioning_latents_async(
+            speaker_file,
+            max_ref_length,
+            gpt_cond_len,
+            gpt_cond_chunk_len
+        )
+        cond_latents = []
+        for text_embedding in text_embeddings:
+            # Concatenate along sequence dimension
+            cond_latents.append((torch.cat([gpt_cond_latent, text_embedding], dim=1).squeeze(0)
+                                 .to(self.llm_engine.engine.model_config.dtype)))
+        return text_tokens, cond_latents, speaker_embeddings
+    async def get_conditioning_latents_async(
+            self,
+            audio_path,
+            max_ref_length=30,
+            gpt_cond_len=6,
+            gpt_cond_chunk_len=6,
+            librosa_trim_db=None,
+            sound_norm_refs=False,
+            load_sr=22050,
+    ):
+        """Async version of get_conditioning_latents with concurrency control."""
+        async with self.semaphore:
+            # Run the original get_conditioning_latents in executor
+            result = await asyncio.get_event_loop().run_in_executor(
+                None,
+                functools.partial(self.get_conditioning_latents,
+                                  audio_path,
+                                  max_ref_length,
+                                  gpt_cond_len,
+                                  gpt_cond_chunk_len,
+                                  librosa_trim_db,
+                                  sound_norm_refs,
+                                  load_sr)
+            )
+        return result
+    async def get_model_logits(self, token_ids: List[int], conditioning: MultiModalDataDict) -> torch.Tensor:
+        """Get model logits for a specific request"""
+        request_id = uuid.uuid4().hex
+        # Add start and end tokens
+        token_ids = [self.mel_bos_token_id] + token_ids + [self.mel_eos_token_id] * 5
+        engine_inputs = TokensPrompt(prompt_token_ids=token_ids)
+        engine_inputs["multi_modal_data"] = conditioning
+        # Bind the collector to this request
+        bound_collector = self.hidden_states_collector.bind_to_request(request_id)
+        # Set up sampling parameters with the bound collector
+        sampling_params = ExtendedSamplingParams(
+            detokenize=False,
+            max_tokens=1,
+            hidden_state_collector=bound_collector,
+        )
+        # Generate with unique request ID
+        generator = self.llm_engine.generate(
+            prompt=engine_inputs,
+            sampling_params=sampling_params,
+            request_id=request_id
+        )
+        # Consume the generator with a timeout
+        try:
+            async def consume_generator():
+                async for _ in generator:
+                    pass
+            await asyncio.wait_for(consume_generator(), timeout=300)
+        except asyncio.TimeoutError:
+            raise RuntimeError("Timeout while generating logits")
+        # Get the collected hidden states
+        hidden_states = self.hidden_states_collector.get_hidden_states(request_id)
+        if hidden_states is None:
+            raise RuntimeError(f"No hidden states collected for request {request_id}")
+        return hidden_states[-len(token_ids):, ...].unsqueeze(0).to(self.device).to(self.dtype)
+    async def process_tokens_to_speech(
+            self,
+            generators: List[AsyncGenerator[RequestOutput, None]],
+            speaker_embeddings: torch.Tensor,
+            multimodal_data: List[torch.Tensor],
+            chunk_size: int = 20,
+    ) -> AsyncGenerator[XTTSOutput, None]:
+        """
+        Process multiple token generators concurrently and emit results sequentially.
+        Uses a queue-based approach to handle multiple generators reliably.
+        """
+        # Create a queue for each generator to store its results
+        queues = [asyncio.Queue() for _ in generators]
+        # Create tasks for processing each generator
+        tasks = []
+        for i, generator in enumerate(generators):
+            task = asyncio.create_task(
+                self._process_single_generator(
+                    generator,
+                    queues[i],
+                    speaker_embeddings,
+                    multimodal_data[i],
+                    chunk_size
+                )
+            )
+            tasks.append(task)
+        try:
+            # Process queues in sequence
+            for i, queue in enumerate(queues):
+                while True:
+                    result = await queue.get()
+                    if result is None:
+                        # This generator has finished
+                        break
+                    else:
+                        yield result
+        finally:
+            # Ensure all tasks are properly cleaned up
+            for task in tasks:
+                if not task.done():
+                    task.cancel()
+            await asyncio.gather(*tasks, return_exceptions=True)
+    async def _process_single_generator(
+            self,
+            generator: AsyncGenerator[RequestOutput, None],
+            queue: asyncio.Queue,
+            speaker_embeddings: torch.Tensor,
+            gpt_embed_input: torch.Tensor,
+            chunk_size: int
+    ) -> None:
+        """Process a single generator and put results in its queue."""
+        try:
+            last_decoded_token = 0
+            accumulated_tokens = []
+            async for output in generator:
+                # Get new tokens
+                new_tokens = output.outputs[0].token_ids[last_decoded_token:]
+                accumulated_tokens.extend(new_tokens)
+                last_decoded_token = len(accumulated_tokens)
+                # Process tokens when we have enough or it's the final output
+                if output.finished:# or len(accumulated_tokens) >= chunk_size: se lascio con acculated token mi ripete gli stesis toke, why??
+                    # Process the accumulated tokens
+                    hidden_states = await self.get_model_logits(
+                        accumulated_tokens,
+                        {
+                            "audio": {
+                                'embeds': gpt_embed_input,
+                                "is_logits_only_mode": True
+                            }
+                        }
+                    )
+                    # Generate audio segment
+                    wav = await asyncio.get_event_loop().run_in_executor(
+                        self.executor,
+                        lambda: self.hifigan_decoder.inference(
+                            hidden_states,
+                            g=speaker_embeddings
+                        ).cpu().numpy().squeeze()
+                    )
+                    # Put result in queue
+                    await queue.put(XTTSOutput(
+                        request_id=output.request_id,
+                        wav=wav
+                    ))
+                    # Reset accumulated tokens
+                    accumulated_tokens = []
+                if output.finished:
+                    break
+        except Exception as e:
+            logging.error(f"Error in generator processing: {e}")
+        finally:
+            # Signal completion
+            await queue.put(None)
+    async def generate_speech_async_from_streaming_source(self, request: XTTSRequest) -> AsyncGenerator[XTTSOutput, None]:
+        """Generate speech for streaming source of text, making a streaming source of audio tokens and then decoding
+        and returning a streaming audio response."""
+        assert isinstance(request.text, AsyncGenerator), "Text must be an AsyncGenerator for streaming source."
+        # Prepare input with conditioning
+        gpt_cond_latent, speaker_embeddings = await self.get_conditioning_latents_async(
+            request.speaker_file,
+            request.max_ref_length,
+            request.gpt_cond_len,
+            request.gpt_cond_chunk_len
+        )
+        sampling_params = SamplingParams(
+            temperature=request.temperature,
+            top_p=request.top_p,
+            detokenize=False,
+            top_k=request.top_k,
+            logits_processors=[LogitsRepetitionPenalizer(request.repetition_penalty)],
+            repetition_penalty=1.0,  # Since we're handling repetition penalty manually
+            max_tokens=self.gpt_config.gpt_max_audio_tokens,
+            ignore_eos=True,  # Ignore the tokenizer eos token since it is for textual generation
+            stop_token_ids=[self.mel_eos_token_id],
+        )
+        accumulated_text = ""
+        async for text in request.text:
+            text = text.strip()
+            accumulated_text += text
+            if len(accumulated_text) > request.generate_every_n_chars:
+                tokens, embeddings = await self.prepare_text_tokens_async(accumulated_text, request.language)
+                gpt_embed_input = [torch.cat([gpt_cond_latent, embeddings[0]], dim=0)]
+                engine_inputs = TokensPrompt(prompt_token_ids=tokens)
+                if gpt_embed_input is not None:
+                    engine_inputs["multi_modal_data"] = {"audio": {"embeds": gpt_embed_input, "is_logits_only_mode": False}}
+                token_generator = [self.llm_engine.generate(
+                    prompt=engine_inputs,
+                    sampling_params=sampling_params,
+                    request_id=request.request_id,
+                )]
+                # Process tokens to speech
+                async for output in self.process_tokens_to_speech(
+                        token_generator,
+                        speaker_embeddings,
+                        gpt_embed_input,
+                        chunk_size=50
+                ):
+                    yield output
+                accumulated_text = ""
+    async def generate_speech_from_text_async(self, request: XTTSRequest) -> AsyncGenerator[XTTSOutput, None]:
+        """Generate speech for a single request asynchronously."""
+        # Prepare input with conditioning
+        tokens_list, gpt_embed_inputs, speaker_embeddings = await self.prepare_inputs_async(
+            request.text,
+            request.language,
+            request.speaker_file,
+            request.max_ref_length,
+            request.gpt_cond_len,
+            request.gpt_cond_chunk_len,
+            split_text=True  # Split text to avoid OOM on big texts
+        )
+        # Start all requests in parallel
+        generators = []
+        for seq_index, sequence in enumerate(tokens_list):
+            sampling_params = SamplingParams(
+                temperature=request.temperature,
+                top_p=request.top_p,
+                detokenize=False,
+                top_k=request.top_k,
+                logits_processors=[LogitsRepetitionPenalizer(request.repetition_penalty)],
+                repetition_penalty=1.0,  # Since we're handling repetition penalty manually
+                max_tokens=self.gpt_config.gpt_max_audio_tokens,
+                ignore_eos=True,  # Ignore the tokenizer eos token since it is for textual generation
+                stop_token_ids=[self.mel_eos_token_id],
+            )
+            engine_inputs = TokensPrompt(prompt_token_ids=sequence)
+            if gpt_embed_inputs is not None:
+                engine_inputs["multi_modal_data"] = {"audio": {"embeds": gpt_embed_inputs[seq_index], "is_logits_only_mode": False}}
+            # Get audio token generator from VLLM
+            token_generator = self.llm_engine.generate(
+                prompt=engine_inputs,
+                sampling_params=sampling_params,
+                request_id=f"{request.request_id}_{seq_index}",
+            )
+            generators.append(token_generator)
+        # Process tokens to speech
+        async for output in self.process_tokens_to_speech(
+                generators,
+                speaker_embeddings,
+                gpt_embed_inputs,
+                chunk_size=50
+        ):
+            yield output
+    def generate_speech_from_text(self, request: XTTSRequest) -> List[XTTSOutput]:
+        """
+        Synchronous wrapper for generate_speech_from_text_async.
+        Args:
+            request: XTTSRequest object containing generation parameters
+        Returns:
+            List of XTTSOutput containing the generated speech segments
+        """
+        async def _collect_outputs():
+            outputs = []
+            async for output in self.generate_speech_from_text_async(request):
+                outputs.append(output)
+            return outputs
+        # Run the async code in an event loop
+        import asyncio
+        # Get or create an event loop
+        try:
+            loop = asyncio.get_event_loop()
+        except RuntimeError:
+            loop = asyncio.new_event_loop()
+            asyncio.set_event_loop(loop)
+        if loop.is_running():
+            # Create a new loop if the current one is running
+            new_loop = asyncio.new_event_loop()
+            results = new_loop.run_until_complete(_collect_outputs())
+            new_loop.close()
+        else:
+            results = loop.run_until_complete(_collect_outputs())
+        return results