---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B
library_name: peft
tags:
- text-to-speech
- ssml
- french
- qwen2.5
- lora
---
# ssml-break2ssml-fr-lora

This is the second-stage LoRA adapter for **French SSML generation**, converting *pause-annotated text* into full SSML markup with `<break>` tags.

This model is part of the cascade described in the paper:

**"Improving French Synthetic Speech Quality via SSML Prosody Control"**
Nassima Ould-Ouali, Éric Moulines – *ICNLSP 2025 (Springer LNCS)* [accepted].


---


## 🧠 Model Details

- **Base model**: [`Qwen/Qwen2.5-7B`](https://huggingface.co/Qwen/Qwen2.5-7B)
- **Adapter method**: LoRA (Low-Rank Adaptation via [`peft`](https://github.com/huggingface/peft))
- **LoRA rank**: 8 — **Alpha**: 16
- **Training**: 5 epochs, batch size 1 (gradient accumulation)
- **Languages**: French
- **Model size**: 7B (adapter-only)
- **License**: Apache 2.0

---

## 🧩 Pipeline Overview

This model is part of a two-stage SSML cascade for improving French TTS prosody:

| Step | Model                                     | Description                                  |
|------|-------------------------------------------|----------------------------------------------|
| 1️⃣   | `nassimaODL/ssml-text2breaks-fr-lora`     | Inserts symbolic pauses like `#250`, `#500`  |
| 2️⃣   | `nassimaODL/ssml-break2ssml-fr-lora`      | Converts symbols to `<break time="..."/>` SSML |


### ✨ Example

```text
Input:  Bonjour#250 comment vas-tu ?
Output: Bonjour<break time="250ms"/> comment vas-tu ?

---

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="auto")
model = PeftModel.from_pretrained(base_model, "nassimaODL/ssml-break2ssml-fr-lora")

input_text = "Bonjour#250 comment vas-tu ?"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))```

--- 

## 🧪 Evaluation Summary

| Metric                    | Value         |
|--------------------------|---------------|
| Pause Insertion Accuracy | 87.3%         |
| RMSE (pause duration)    | 98.5 ms       |
| MOS gain (vs. baseline)  | +0.42         |

Evaluation was performed on a held-out French validation set with annotated SSML pauses. Mean Opinion Score (MOS) improvements were assessed using TTS outputs rendered with Azure Henri voice and rated by 30 native French speakers.

---

## �� Training Data

This LoRA adapter was trained on a corpus of ~4,500 French utterances. Input texts were annotated with symbolic pause indicators (e.g., `#250` for 250ms), automatically aligned using a combination of Whisper-Kyutai timestamping and F0/syntactic heuristics.

Annotations were refined via a hybrid heuristic rule set combining:
- Voice activity boundaries (via Auditok)
- F0 contour analysis (pitch dips before breaks)
- Syntactic cues (punctuation, conjunctions)

For full details, see our data preparation pipeline on GitHub:  
🔗 [https://github.com/NassimaOULDOUALI/Prosody-Control-French-TTS](https://github.com/NassimaOULDOUALI/Prosody-Control-French-TTS)

---

## ⚙️ Training Setup

- **Compute**: Jean-Zay (GENCI/IDRIS), A100 80GB x1
- **Framework**: HuggingFace `transformers` + `peft`
- **LoRA method**: rank = 8, alpha = 16, dropout = 0.05
- **Precision**: bf16
- **Max sequence length**: 768 tokens (256 input + 512 output)
- **Epochs**: 5
- **Optimizer**: AdamW (lr = 2e-4, no warmup)
- **LoRA target modules**:
  `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`

Training was performed using the [Unsloth](https://github.com/unslothai/unsloth) SFTTrainer and PEFT adapter injection on Qwen2.5-7B base.

---

## ⚠️ Limitations

- Only `<break>` tags are supported; no pitch, rate, or emphasis control yet.
- Pause accuracy is sensitive to punctuation and malformed inputs.
- SSML output has been optimized primarily for Azure voices (e.g., `fr-FR-HenriNeural`). Other engines may interpret `<break>` tags differently.
- The model assumes the presence of symbolic pause markers in the input (e.g., `#250`). For automatic prediction of such symbols, refer to our stage-1 model:  
  🔗 [`nassimaODL/ssml-text2breaks-fr-lora`](https://huggingface.co/nassimaODL/ssml-text2breaks-fr-lora)

---
@inproceedings{ould-ouali2025improving,
  author    = {Nassima Ould-Ouali and Awais Sani and Tim Luka Horstmann and Jonah Dauvet and Ruben Bueno and Éric Moulines},
  title     = {Improving French Synthetic Speech Quality via SSML Prosody Control},
  booktitle = {Proceedings of the 9th International Conference on Natural Language and Speech Processing (ICNLSP)},
  series    = {Lecture Notes in Computer Science},
  publisher = {Springer},
  year      = {2025},
  note      = {To appear}
}