--- language: ar tags: - text-to-speech - tts - arabic - styletts2 - pl-bert license: mit hardware: H100 --- # Model Card for Arabic StyleTTS2 This is an Arabic text-to-speech model based on StyleTTS2 architecture, specifically adapted for Arabic language synthesis. The model achieves good quality Arabic speech synthesis, though not yet state-of-the-art, and further experimentation is needed to optimize performance for Arabic language specifically. All training objectives from the original StyleTTS2 were maintained, except for the WavLM objectives which were removed as they were primarily designed for English speech. ## Example Here is an example output from the model: #### Sample 1 ## Efficiency and Performance A key strength of this model lies in its efficiency and performance characteristics: - **Compact Architecture**: Achieves impressive quality with <100M parameters - **Limited Training Data**: Trained on only 22 hours of single-speaker audio - **Transfer Learning**: Successfully fine-tuned from LibriTTS multi-speaker model to single-speaker Arabic - **Resource Efficient**: Good quality achieved despite limited computational resources Note: According to the StyleTTS2 authors, performance should improve further when training a single-speaker model from scratch rather than fine-tuning. This wasn't attempted in our case due to computational resource constraints, suggesting potential for even better results with more extensive training. ## Model Details ### Model Description This model is a modified version of StyleTTS2, specifically adapted for Arabic text-to-speech synthesis. It incorporates a custom-trained PL-BERT model for Arabic language understanding and removes the WavLM adversarial training component (which was primarily designed for English). - **Developed by:** Fadi (GitHub: Fadi987) - **Model type:** Text-to-Speech (StyleTTS2 architecture) - **Language(s):** Arabic - **Finetuned from model:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS) ### Model Sources - **Repository:** [Fadi987/StyleTTS2](https://github.com/Fadi987/StyleTTS2) - **Paper:** [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691) - **PL-BERT Model:** [fadi77/pl-bert](https://huggingface.co/fadi77/pl-bert) ## Uses ### Direct Use The model can be used for generating Arabic speech from text. To use the model: 1. Clone the StyleTTS2 repository: ```bash git clone https://github.com/Fadi987/StyleTTS2 cd StyleTTS2 ``` 2. Install `espeak-ng` for phonemization backend: ```bash # For macOS brew install espeak-ng # For Ubuntu/Debian sudo apt-get install espeak-ng # For Windows # Download and install espeak-ng from: https://github.com/espeak-ng/espeak-ng/releases ``` 3. Install Python dependencies: ```bash pip install -r requirements.txt ``` 4. Download the `model.pth` and `config.yml` files from this repository 5. Run inference using: ```bash python inference.py --config config.yml --model model.pth --text "الإِتْقَانُ يَحْتَاجُ إِلَى الْعَمَلِ وَالْمُثَابَرَة" ``` Make sure use properly diacritized Arabic text for best results ### Out-of-Scope Use The model is specifically designed for Arabic text-to-speech synthesis and may not perform well for: - Other languages - Heavy dialect variations - Non-diacritized Arabic text ## Training Details ### Training Data - Training was performed on approximately 22 hours of Arabic audiobook data - Dataset: [fadi77/arabic-audiobook-dataset-24khz](https://huggingface.co/datasets/fadi77/arabic-audiobook-dataset-24khz) - The PL-BERT component was trained on fully diacritized Wikipedia Arabic text ### Training Hyperparameters - **Number of epochs:** 20 - **Diffusion training:** Started from epoch 5 ### Objectives - **Training objectives:** All original StyleTTS2 objectives maintained, except WavLM adversarial training - **Validation objectives:** Identical to original StyleTTS2 validation process ### Compute Infrastructure - **Hardware Type:** NVIDIA H100 GPU ### Notable Modifications from Original StyleTTS2 in Architecture and Objectives The architecture of the model follows that of StyleTTS2 with the following exceptions: - Removed WavLM adversarial training component - Custom PL-BERT trained for Arabic language ## Citation **BibTeX:** ```bibtex @article{styletts2, title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, author={Liu, Yinghao Aaron and Chen, Tao and Ping, Wei and Wu, Xiaoliang and Wang, Dongchao and Duan, Yuxuan and Li, Xiaodi and Li, Chong and Liang, Xuchen and Liu, Qiong and others}, journal={arXiv preprint arXiv:2306.07691}, year={2023} } ``` ## Model Card Contact GitHub: [@Fadi987](https://github.com/Fadi987) Hugging Face: [@fadi77](https://huggingface.co/fadi77)