|
--- |
|
license: apache-2.0 |
|
base_model: Qwen/Qwen2.5-7B |
|
library_name: peft |
|
language: |
|
- fr |
|
tags: |
|
- text-to-speech |
|
- lora |
|
- peft |
|
- ssml |
|
- qwen2.5 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# π£οΈ French Text-to-Breaks LoRA Model |
|
|
|
**hi-paris/ssml-text2breaks-fr-lora** is a LoRA adapter fine-tuned on Qwen2.5-7B to predict natural pause locations in French text by adding symbolic `<break/>` markers. |
|
|
|
This is the **first stage** of a two-step SSML cascade pipeline for improving French text-to-speech prosody control. |
|
|
|
> π **Paper**: *"Improving Synthetic Speech Quality via SSML Prosody Control"* |
|
> **Authors**: Nassima Ould-Ouali, Awais Sani, Ruben Bueno, Jonah Dauvet, Tim Luka Horstmann, Eric Moulines |
|
> **Conference**: ICNLSP 2025 |
|
> π **Demo & Audio Samples**: https://horstmann.tech/ssml-prosody-control/ |
|
|
|
## π§© Pipeline Overview |
|
|
|
| Stage | Model | Purpose | |
|
|-------|-------|---------| |
|
| 1οΈβ£ | **hi-paris/ssml-text2breaks-fr-lora** | Predicts natural pause locations | |
|
| 2οΈβ£ | [hi-paris/ssml-breaks2ssml-fr-lora](https://huggingface.co/hi-paris/ssml-breaks2ssml-fr-lora) | Converts breaks to full SSML with prosody | |
|
|
|
## β¨ Example |
|
|
|
**Input:** |
|
``` |
|
Bonjour comment allez-vous aujourd'hui ? |
|
``` |
|
|
|
**Output:** |
|
``` |
|
Bonjour comment allez-vous aujourd'hui ?<break/> |
|
``` |
|
|
|
## π Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install torch transformers peft accelerate |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from peft import PeftModel |
|
import torch |
|
|
|
# Load base model and tokenizer |
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
"Qwen/Qwen2.5-7B", |
|
torch_dtype=torch.float16, |
|
device_map="auto" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B") |
|
|
|
# Load LoRA adapter |
|
model = PeftModel.from_pretrained(base_model, "hi-paris/ssml-text2breaks-fr-lora") |
|
|
|
# Prepare input |
|
text = "Bonjour comment allez-vous aujourd'hui ?" |
|
formatted_input = f"### Task:\nConvert text to SSML with pauses:\n\n### Text:\n{text}\n\n### SSML:\n" |
|
|
|
# Generate |
|
inputs = tokenizer(formatted_input, return_tensors="pt").to(model.device) |
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=256, |
|
temperature=0.7, |
|
do_sample=True, |
|
pad_token_id=tokenizer.eos_token_id |
|
) |
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
result = response.split("### SSML:\n")[-1].strip() |
|
print(result) # "Bonjour comment allez-vous aujourd'hui ?<break/>" |
|
``` |
|
|
|
### Production Usage (Recommended) |
|
|
|
For production use with memory optimization and full cascade, see our [inference repository](https://github.com/TimLukaHorstmann/cascading_model): |
|
|
|
```python |
|
from text2breaks_inference import Text2BreaksInference |
|
|
|
# Memory-efficient shared model approach |
|
model = Text2BreaksInference() |
|
result = model.predict("Bonjour comment allez-vous aujourd'hui ?") |
|
``` |
|
|
|
## π§ Full Cascade Example |
|
|
|
```python |
|
from breaks2ssml_inference import CascadedInference |
|
|
|
# Initialize full pipeline (memory efficient) |
|
cascade = CascadedInference() |
|
|
|
# Convert plain text directly to full SSML |
|
text = "Bonjour comment allez-vous aujourd'hui ?" |
|
ssml_output = cascade.predict(text) |
|
print(ssml_output) |
|
# Output: '<prosody pitch="+2.5%" rate="-1.2%" volume="-5.0%">Bonjour comment allez-vous aujourd'hui ?</prosody><break time="300ms"/>' |
|
``` |
|
|
|
## π§ Model Details |
|
|
|
- **Base Model**: [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) |
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) |
|
- **LoRA Rank**: 8, Alpha: 16 |
|
- **Target Modules**: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
|
- **Training**: 5 epochs, batch size 1 with gradient accumulation |
|
- **Language**: French |
|
- **Model Size**: 7B parameters (LoRA adapter: ~81MB) |
|
- **License**: Apache 2.0 |
|
|
|
## π Performance |
|
|
|
The model achieves high accuracy in predicting natural pause locations in French text, contributing to improved prosody in text-to-speech synthesis when combined with the second-stage model. |
|
|
|
## π Resources |
|
|
|
- **Full Pipeline Code**: https://github.com/TimLukaHorstmann/cascading_model |
|
- **Interactive Demo**: [Colab Notebook](https://colab.research.google.com/drive/1bFcbJQY9OuY0_zlscqkf9PIgd3dUrIKs?usp=sharing) |
|
- **Stage 2 Model**: [hi-paris/ssml-breaks2ssml-fr-lora](https://huggingface.co/hi-paris/ssml-breaks2ssml-fr-lora) |
|
|
|
## π Citation |
|
|
|
```bibtex |
|
@inproceedings{ould-ouali2025_improving, |
|
title = {Improving Synthetic Speech Quality via SSML Prosody Control}, |
|
author = {Ould-Ouali, Nassima and Sani, Awais and Bueno, Ruben and Dauvet, Jonah and Horstmann, Tim Luka and Moulines, Eric}, |
|
booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP)}, |
|
year = {2025}, |
|
url = {https://huggingface.co/hi-paris} |
|
} |
|
``` |
|
|
|
## π License |
|
|
|
Apache 2.0 License (same as the base Qwen2.5-7B model) |
|
|