--- license: apache-2.0 base_model: Qwen/Qwen2.5-7B library_name: peft tags: - text-to-speech - ssml - french - qwen2.5 - lora --- # ssml-break2ssml-fr-lora This is the second-stage LoRA adapter for **French SSML generation**, converting *pause-annotated text* into full SSML markup with `` tags. This model is part of the cascade described in the paper: **"Improving French Synthetic Speech Quality via SSML Prosody Control"** Nassima Ould-Ouali, Éric Moulines – *ICNLSP 2025 (Springer LNCS)* [accepted]. --- ## 🧠 Model Details - **Base model**: [`Qwen/Qwen2.5-7B`](https://huggingface.co/Qwen/Qwen2.5-7B) - **Adapter method**: LoRA (Low-Rank Adaptation via [`peft`](https://github.com/huggingface/peft)) - **LoRA rank**: 8 — **Alpha**: 16 - **Training**: 5 epochs, batch size 1 (gradient accumulation) - **Languages**: French - **Model size**: 7B (adapter-only) - **License**: Apache 2.0 --- ## 🧩 Pipeline Overview This model is part of a two-stage SSML cascade for improving French TTS prosody: | Step | Model | Description | |------|-------------------------------------------|----------------------------------------------| | 1️⃣ | `nassimaODL/ssml-text2breaks-fr-lora` | Inserts symbolic pauses like `#250`, `#500` | | 2️⃣ | `nassimaODL/ssml-break2ssml-fr-lora` | Converts symbols to `` SSML | ### ✨ Example ```text Input: Bonjour#250 comment vas-tu ? Output: Bonjour comment vas-tu ? --- ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B") base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="auto") model = PeftModel.from_pretrained(base_model, "nassimaODL/ssml-break2ssml-fr-lora") input_text = "Bonjour#250 comment vas-tu ?" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True))``` --- ## 🧪 Evaluation Summary | Metric | Value | |--------------------------|---------------| | Pause Insertion Accuracy | 87.3% | | RMSE (pause duration) | 98.5 ms | | MOS gain (vs. baseline) | +0.42 | Evaluation was performed on a held-out French validation set with annotated SSML pauses. Mean Opinion Score (MOS) improvements were assessed using TTS outputs rendered with Azure Henri voice and rated by 30 native French speakers. --- ## �� Training Data This LoRA adapter was trained on a corpus of ~4,500 French utterances. Input texts were annotated with symbolic pause indicators (e.g., `#250` for 250ms), automatically aligned using a combination of Whisper-Kyutai timestamping and F0/syntactic heuristics. Annotations were refined via a hybrid heuristic rule set combining: - Voice activity boundaries (via Auditok) - F0 contour analysis (pitch dips before breaks) - Syntactic cues (punctuation, conjunctions) For full details, see our data preparation pipeline on GitHub: 🔗 [https://github.com/NassimaOULDOUALI/Prosody-Control-French-TTS](https://github.com/NassimaOULDOUALI/Prosody-Control-French-TTS) --- ## ⚙️ Training Setup - **Compute**: Jean-Zay (GENCI/IDRIS), A100 80GB x1 - **Framework**: HuggingFace `transformers` + `peft` - **LoRA method**: rank = 8, alpha = 16, dropout = 0.05 - **Precision**: bf16 - **Max sequence length**: 768 tokens (256 input + 512 output) - **Epochs**: 5 - **Optimizer**: AdamW (lr = 2e-4, no warmup) - **LoRA target modules**: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` Training was performed using the [Unsloth](https://github.com/unslothai/unsloth) SFTTrainer and PEFT adapter injection on Qwen2.5-7B base. --- ## ⚠️ Limitations - Only `` tags are supported; no pitch, rate, or emphasis control yet. - Pause accuracy is sensitive to punctuation and malformed inputs. - SSML output has been optimized primarily for Azure voices (e.g., `fr-FR-HenriNeural`). Other engines may interpret `` tags differently. - The model assumes the presence of symbolic pause markers in the input (e.g., `#250`). For automatic prediction of such symbols, refer to our stage-1 model: 🔗 [`nassimaODL/ssml-text2breaks-fr-lora`](https://huggingface.co/nassimaODL/ssml-text2breaks-fr-lora) --- @inproceedings{ould-ouali2025improving, author = {Nassima Ould-Ouali and Awais Sani and Tim Luka Horstmann and Jonah Dauvet and Ruben Bueno and Éric Moulines}, title = {Improving French Synthetic Speech Quality via SSML Prosody Control}, booktitle = {Proceedings of the 9th International Conference on Natural Language and Speech Processing (ICNLSP)}, series = {Lecture Notes in Computer Science}, publisher = {Springer}, year = {2025}, note = {To appear} }