File size: 5,664 Bytes
8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 c613c2f 8814ae3 486463d 8814ae3 486463d 6da1d05 c613c2f 486463d 8814ae3 486463d 9103cb2 486463d 9103cb2 486463d c1c065a 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d c613c2f 486463d c1c065a 486463d 8814ae3 486463d 9103cb2 486463d 8814ae3 486463d 8814ae3 486463d 9103cb2 486463d 8814ae3 486463d 9103cb2 486463d 9103cb2 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d 9103cb2 486463d f8457de 486463d 8814ae3 486463d 8814ae3 486463d 8814ae3 486463d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B
library_name: peft
language:
- fr
tags:
- lora
- peft
- ssml
- text-to-speech
- qwen2.5
pipeline_tag: text-generation
---
# π£οΈ French Breaks-to-SSML LoRA Model
**hi-paris/ssml-breaks2ssml-fr-lora** is a LoRA adapter fine-tuned on Qwen2.5-7B to convert text with symbolic `<break/>` markers into rich SSML markup with prosody control (pitch, rate, volume) and precise break timing.
This is the **second stage** of a two-step SSML cascade pipeline for improving French text-to-speech prosody control.
> π **Paper**: *"Improving Synthetic Speech Quality via SSML Prosody Control"*
> **Authors**: Nassima Ould-Ouali, Awais Sani, Ruben Bueno, Jonah Dauvet, Tim Luka Horstmann, Eric Moulines
> **Conference**: ICNLSP 2025
> π **Demo & Audio Samples**: https://horstmann.tech/ssml-prosody-control/
## π§© Pipeline Overview
| Stage | Model | Purpose |
|-------|-------|---------|
| 1οΈβ£ | [hi-paris/ssml-text2breaks-fr-lora](https://huggingface.co/hi-paris/ssml-text2breaks-fr-lora) | Predicts natural pause locations |
| 2οΈβ£ | **hi-paris/ssml-breaks2ssml-fr-lora** | Converts breaks to full SSML with prosody |
## β¨ Example
**Input:**
```
Bonjour comment allez-vous ?<break/>
```
**Output:**
```
<prosody pitch="+2.5%" rate="-1.2%" volume="-5.0%">Bonjour comment allez-vous ?</prosody><break time="300ms"/>
```
## π Quick Start
### Installation
```bash
pip install torch transformers peft accelerate
```
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "hi-paris/ssml-breaks2ssml-fr-lora")
# Prepare input (text with <break/> markers)
text_with_breaks = "Bonjour comment allez-vous ?<break/>"
formatted_input = f"### Task:\nConvert text to SSML with pauses:\n\n### Text:\n{text_with_breaks}\n\n### SSML:\n"
# Generate
inputs = tokenizer(formatted_input, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.3,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
result = response.split("### SSML:\n")[-1].strip()
print(result)
```
### Production Usage (Recommended)
For production use with memory optimization, see our [inference repository](https://github.com/TimLukaHorstmann/cascading_model):
```python
from breaks2ssml_inference import Breaks2SSMLInference
# Memory-efficient shared model approach
model = Breaks2SSMLInference()
result = model.predict("Bonjour comment allez-vous ?<break/>")
```
## π§ Full Cascade Example
```python
from breaks2ssml_inference import CascadedInference
# Initialize full pipeline (memory efficient - single base model)
cascade = CascadedInference()
# Convert plain text directly to full SSML
text = "Bonjour comment allez-vous aujourd'hui ?"
ssml_output = cascade.predict(text)
print(ssml_output)
# Output: '<prosody pitch="+2.5%" rate="-1.2%" volume="-5.0%">Bonjour comment allez-vous aujourd'hui ?</prosody><break time="300ms"/>'
```
## π§ Model Details
- **Base Model**: [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **LoRA Rank**: 8, Alpha: 16
- **Target Modules**: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
- **Training**: 5 epochs, batch size 1 with gradient accumulation
- **Language**: French
- **Model Size**: 7B parameters (LoRA adapter: ~81MB)
- **License**: Apache 2.0
## π Performance
| Metric | Score |
|--------|-------|
| Pause Insertion Accuracy | 87.3% |
| RMSE (pause duration) | 98.5 ms |
| MOS gain (vs. baseline) | +0.42 |
*Evaluation performed on held-out French validation set with annotated SSML pauses. Mean Opinion Score (MOS) improvements assessed using TTS outputs with Azure Henri voice, rated by 30 native French speakers.*
## π― SSML Features Generated
- **Prosody Control**: Dynamic pitch, rate, and volume adjustments
- **Break Timing**: Precise pause durations (e.g., `<break time="300ms"/>`)
- **Contextual Adaptation**: Prosody values adapted to semantic content
## β οΈ Limitations
- Optimized primarily for Azure TTS voices (e.g., `fr-FR-HenriNeural`)
- Requires input text with `<break/>` markers (use Stage 1 model for automatic prediction)
- Currently supports break tags only (pitch/rate/volume via prosody wrapper)
## π Resources
- **Full Pipeline Code**: https://github.com/TimLukaHorstmann/cascading_model
- **Interactive Demo**: [Colab Notebook](https://colab.research.google.com/drive/1bFcbJQY9OuY0_zlscqkf9PIgd3dUrIKs?usp=sharing)
- **Stage 1 Model**: [hi-paris/ssml-text2breaks-fr-lora](https://huggingface.co/hi-paris/ssml-text2breaks-fr-lora)
## π Citation
```bibtex
@inproceedings{ould-ouali2025_improving,
title = {Improving Synthetic Speech Quality via SSML Prosody Control},
author = {Ould-Ouali, Nassima and Sani, Awais and Bueno, Ruben and Dauvet, Jonah and Horstmann, Tim Luka and Moulines, Eric},
booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP)},
year = {2025},
url = {https://huggingface.co/hi-paris}
}
```
## π License
Apache 2.0 License (same as the base Qwen2.5-7B model)
|