TimLukaHorstmann commited on
Commit
486463d
·
1 Parent(s): f8457de

Updated model + inference + model card + colab

Browse files
README.md CHANGED
@@ -1,153 +1,172 @@
1
  ---
2
-
3
-
4
  license: apache-2.0
5
  base_model: Qwen/Qwen2.5-7B
6
  library_name: peft
 
 
7
  tags:
8
- - text-to-speech
 
9
  - ssml
10
- - french
11
  - qwen2.5
12
- - lora
13
-
14
-
15
- ---
16
-
17
- # 🗣️ ssml-break2ssml-fr-lora
18
-
19
-
20
- This is the second-stage LoRA adapter for **French SSML generation**, converting *pause-annotated text* into full SSML markup with `<break>` tags.
21
-
22
- This model is part of the cascade described in the paper:
23
-
24
- **"Improving French Synthetic Speech Quality via SSML Prosody Control"**
25
- Nassima Ould-Ouali, Éric Moulines – *ICNLSP 2025 (Springer LNCS)* [accepted].
26
-
27
-
28
  ---
29
 
 
30
 
31
- ## 🧠 Model Details
32
 
33
- - **Base model**: [`Qwen/Qwen2.5-7B`](https://huggingface.co/Qwen/Qwen2.5-7B)
34
- - **Adapter method**: LoRA (Low-Rank Adaptation via [`peft`](https://github.com/huggingface/peft))
35
- - **LoRA rank**: 8 — **Alpha**: 16
36
- - **Training**: 5 epochs, batch size 1 (gradient accumulation)
37
- - **Languages**: French
38
- - **Model size**: 7B (adapter-only)
39
- - **License**: Apache 2.0
40
 
41
- ---
 
 
 
42
 
43
  ## 🧩 Pipeline Overview
44
 
45
- This model is part of a two-stage SSML cascade for improving French TTS prosody:
46
-
47
- | Step | Model | Description |
48
- |------|-------------------------------------------|----------------------------------------------|
49
- | 1️⃣ | `nassimaODL/ssml-text2breaks-fr-lora` | Inserts symbolic pauses like `#250`, `#500` |
50
- | 2️⃣ | `nassimaODL/ssml-break2ssml-fr-lora` | Converts symbols to `<break time="..."/>` SSML |
51
-
52
 
53
  ## ✨ Example
54
 
55
- ```text
56
- Input: Bonjour#250 comment vas-tu ?
57
- Output: Bonjour<break time="250ms"/> comment vas-tu ?
 
58
 
 
 
 
59
  ```
60
 
61
- ---
62
 
 
63
 
64
- ## 🚀 How to run the code
 
 
65
 
66
- ```python
67
 
 
68
  from transformers import AutoTokenizer, AutoModelForCausalLM
69
  from peft import PeftModel
70
-
 
 
 
 
 
 
 
71
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
72
- base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="auto")
73
- model = PeftModel.from_pretrained(base_model, "nassimaODL/ssml-break2ssml-fr-lora")
74
 
75
- input_text = "Bonjour#250 comment vas-tu ?"
76
- inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
77
 
78
- with torch.no_grad():
79
- outputs = model.generate(**inputs, max_new_tokens=128)
80
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ```
83
 
84
- ---
85
-
86
-
87
- ## 🧪 Evaluation Summary
88
 
 
89
 
90
- | Metric | Value |
91
- |--------------------------|---------------|
92
- | Pause Insertion Accuracy | 87.3% |
93
- | RMSE (pause duration) | 98.5 ms |
94
- | MOS gain (vs. baseline) | +0.42 |
95
 
96
- Evaluation was performed on a held-out French validation set with annotated SSML pauses. Mean Opinion Score (MOS) improvements were assessed using TTS outputs rendered with Azure Henri voice and rated by 30 native French speakers.
 
 
 
97
 
 
98
 
99
- ---
 
100
 
 
 
101
 
102
- ## 📚 Training Data
 
 
 
 
 
103
 
 
104
 
105
- This LoRA adapter was trained on a corpus of ~4,500 French utterances. Input texts were annotated with symbolic pause indicators (e.g., `#250` for 250ms), automatically aligned using a combination of Whisper-Kyutai timestamping and F0/syntactic heuristics.
 
 
 
 
 
 
 
106
 
107
- Annotations were refined via a hybrid heuristic rule set combining:
108
- - Voice activity boundaries (via Auditok)
109
- - F0 contour analysis (pitch dips before breaks)
110
- - Syntactic cues (punctuation, conjunctions)
111
 
112
- For full details, see our data preparation pipeline on GitHub:
113
- 🔗 [https://github.com/NassimaOULDOUALI/Prosody-Control-French-TTS](https://github.com/NassimaOULDOUALI/Prosody-Control-French-TTS)
 
 
 
114
 
115
- ---
116
 
117
- ## ⚙️ Training Setup
118
 
119
- - **Compute**: Jean-Zay (GENCI/IDRIS), A100 80GB x1
120
- - **Framework**: HuggingFace `transformers` + `peft`
121
- - **LoRA method**: rank = 8, alpha = 16, dropout = 0.05
122
- - **Precision**: bf16
123
- - **Max sequence length**: 768 tokens (256 input + 512 output)
124
- - **Epochs**: 5
125
- - **Optimizer**: AdamW (lr = 2e-4, no warmup)
126
- - **LoRA target modules**:
127
- `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
128
 
129
- Training was performed using the [Unsloth](https://github.com/unslothai/unsloth) SFTTrainer and PEFT adapter injection on Qwen2.5-7B base.
130
 
131
- ---
 
 
132
 
133
- ## ⚠️ Limitations
134
 
135
- - Only `<break>` tags are supported; no pitch, rate, or emphasis control yet.
136
- - Pause accuracy is sensitive to punctuation and malformed inputs.
137
- - SSML output has been optimized primarily for Azure voices (e.g., `fr-FR-HenriNeural`). Other engines may interpret `<break>` tags differently.
138
- - The model assumes the presence of symbolic pause markers in the input (e.g., `#250`). For automatic prediction of such symbols, refer to our stage-1 model:
139
- 🔗 [`nassimaODL/ssml-text2breaks-fr-lora`](https://huggingface.co/nassimaODL/ssml-text2breaks-fr-lora)
140
-
141
- ---
142
 
143
  ## 📖 Citation
 
 
144
  @inproceedings{ould-ouali2025_improving,
145
  title = {Improving Synthetic Speech Quality via SSML Prosody Control},
146
  author = {Ould-Ouali, Nassima and Sani, Awais and Bueno, Ruben and Dauvet, Jonah and Horstmann, Tim Luka and Moulines, Eric},
147
- booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP)}, % TODO: vérifier l'intitulé exact utilisé par la conf
148
  year = {2025},
149
- pages = {XX--YY}, % TODO
150
- publisher = {—}, % TODO
151
- address = {—} % TODO
152
  }
 
 
 
153
 
 
 
1
  ---
 
 
2
  license: apache-2.0
3
  base_model: Qwen/Qwen2.5-7B
4
  library_name: peft
5
+ language:
6
+ - fr
7
  tags:
8
+ - lora
9
+ - peft
10
  - ssml
11
+ - text-to-speech
12
  - qwen2.5
13
+ pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
+ # 🗣️ French Breaks-to-SSML LoRA Model
17
 
18
+ **hi-paris/ssml-breaks2ssml-fr-lora** is a LoRA adapter fine-tuned on Qwen2.5-7B to convert text with symbolic `<break/>` markers into rich SSML markup with prosody control (pitch, rate, volume) and precise break timing.
19
 
20
+ This is the **second stage** of a two-step SSML cascade pipeline for improving French text-to-speech prosody control.
 
 
 
 
 
 
21
 
22
+ > 📄 **Paper**: *"Improving Synthetic Speech Quality via SSML Prosody Control"*
23
+ > **Authors**: Nassima Ould-Ouali, Awais Sani, Ruben Bueno, Jonah Dauvet, Tim Luka Horstmann, Eric Moulines
24
+ > **Conference**: ICNLSP 2025
25
+ > 🔗 **Demo & Audio Samples**: https://horstmann.tech/ssml-prosody-control/
26
 
27
  ## 🧩 Pipeline Overview
28
 
29
+ | Stage | Model | Purpose |
30
+ |-------|-------|---------|
31
+ | 1️⃣ | [hi-paris/ssml-text2breaks-fr-lora](https://huggingface.co/hi-paris/ssml-text2breaks-fr-lora) | Predicts natural pause locations |
32
+ | 2️⃣ | **hi-paris/ssml-breaks2ssml-fr-lora** | Converts breaks to full SSML with prosody |
 
 
 
33
 
34
  ## ✨ Example
35
 
36
+ **Input:**
37
+ ```
38
+ Bonjour comment allez-vous ?<break/>
39
+ ```
40
 
41
+ **Output:**
42
+ ```
43
+ <prosody pitch="+2.5%" rate="-1.2%" volume="-5.0%">Bonjour comment allez-vous ?</prosody><break time="300ms"/>
44
  ```
45
 
46
+ ## 🚀 Quick Start
47
 
48
+ ### Installation
49
 
50
+ ```bash
51
+ pip install torch transformers peft accelerate
52
+ ```
53
 
54
+ ### Basic Usage
55
 
56
+ ```python
57
  from transformers import AutoTokenizer, AutoModelForCausalLM
58
  from peft import PeftModel
59
+ import torch
60
+
61
+ # Load base model and tokenizer
62
+ base_model = AutoModelForCausalLM.from_pretrained(
63
+ "Qwen/Qwen2.5-7B",
64
+ torch_dtype=torch.float16,
65
+ device_map="auto"
66
+ )
67
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
 
 
68
 
69
+ # Load LoRA adapter
70
+ model = PeftModel.from_pretrained(base_model, "hi-paris/ssml-breaks2ssml-fr-lora")
71
 
72
+ # Prepare input (text with <break/> markers)
73
+ text_with_breaks = "Bonjour comment allez-vous ?<break/>"
74
+ formatted_input = f"### Task:\nConvert text to SSML with pauses:\n\n### Text:\n{text_with_breaks}\n\n### SSML:\n"
75
 
76
+ # Generate
77
+ inputs = tokenizer(formatted_input, return_tensors="pt").to(model.device)
78
+ with torch.no_grad():
79
+ outputs = model.generate(
80
+ **inputs,
81
+ max_new_tokens=128,
82
+ temperature=0.3,
83
+ do_sample=False,
84
+ pad_token_id=tokenizer.eos_token_id
85
+ )
86
+
87
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
88
+ result = response.split("### SSML:\n")[-1].strip()
89
+ print(result)
90
  ```
91
 
92
+ ### Production Usage (Recommended)
 
 
 
93
 
94
+ For production use with memory optimization, see our [inference repository](https://github.com/TimLukaHorstmann/cascading_model):
95
 
96
+ ```python
97
+ from breaks2ssml_inference import Breaks2SSMLInference
 
 
 
98
 
99
+ # Memory-efficient shared model approach
100
+ model = Breaks2SSMLInference()
101
+ result = model.predict("Bonjour comment allez-vous ?<break/>")
102
+ ```
103
 
104
+ ## 🔧 Full Cascade Example
105
 
106
+ ```python
107
+ from breaks2ssml_inference import CascadedInference
108
 
109
+ # Initialize full pipeline (memory efficient - single base model)
110
+ cascade = CascadedInference()
111
 
112
+ # Convert plain text directly to full SSML
113
+ text = "Bonjour comment allez-vous aujourd'hui ?"
114
+ ssml_output = cascade.predict(text)
115
+ print(ssml_output)
116
+ # Output: '<prosody pitch="+2.5%" rate="-1.2%" volume="-5.0%">Bonjour comment allez-vous aujourd'hui ?</prosody><break time="300ms"/>'
117
+ ```
118
 
119
+ ## 🧠 Model Details
120
 
121
+ - **Base Model**: [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
122
+ - **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
123
+ - **LoRA Rank**: 8, Alpha: 16
124
+ - **Target Modules**: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
125
+ - **Training**: 5 epochs, batch size 1 with gradient accumulation
126
+ - **Language**: French
127
+ - **Model Size**: 7B parameters (LoRA adapter: ~81MB)
128
+ - **License**: Apache 2.0
129
 
130
+ ## 📊 Performance
 
 
 
131
 
132
+ | Metric | Score |
133
+ |--------|-------|
134
+ | Pause Insertion Accuracy | 87.3% |
135
+ | RMSE (pause duration) | 98.5 ms |
136
+ | MOS gain (vs. baseline) | +0.42 |
137
 
138
+ *Evaluation performed on held-out French validation set with annotated SSML pauses. Mean Opinion Score (MOS) improvements assessed using TTS outputs with Azure Henri voice, rated by 30 native French speakers.*
139
 
140
+ ## 🎯 SSML Features Generated
141
 
142
+ - **Prosody Control**: Dynamic pitch, rate, and volume adjustments
143
+ - **Break Timing**: Precise pause durations (e.g., `<break time="300ms"/>`)
144
+ - **Contextual Adaptation**: Prosody values adapted to semantic content
 
 
 
 
 
 
145
 
146
+ ## ⚠️ Limitations
147
 
148
+ - Optimized primarily for Azure TTS voices (e.g., `fr-FR-HenriNeural`)
149
+ - Requires input text with `<break/>` markers (use Stage 1 model for automatic prediction)
150
+ - Currently supports break tags only (pitch/rate/volume via prosody wrapper)
151
 
152
+ ## 🔗 Resources
153
 
154
+ - **Full Pipeline Code**: https://github.com/TimLukaHorstmann/cascading_model
155
+ - **Interactive Demo**: [Colab Notebook](https://colab.research.google.com/drive/1bFcbJQY9OuY0_zlscqkf9PIgd3dUrIKs?usp=sharing)
156
+ - **Stage 1 Model**: [hi-paris/ssml-text2breaks-fr-lora](https://huggingface.co/hi-paris/ssml-text2breaks-fr-lora)
 
 
 
 
157
 
158
  ## 📖 Citation
159
+
160
+ ```bibtex
161
  @inproceedings{ould-ouali2025_improving,
162
  title = {Improving Synthetic Speech Quality via SSML Prosody Control},
163
  author = {Ould-Ouali, Nassima and Sani, Awais and Bueno, Ruben and Dauvet, Jonah and Horstmann, Tim Luka and Moulines, Eric},
164
+ booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP)},
165
  year = {2025},
166
+ url = {https://huggingface.co/hi-paris}
 
 
167
  }
168
+ ```
169
+
170
+ ## 📜 License
171
 
172
+ Apache 2.0 License (same as the base Qwen2.5-7B model)
added_tokens.json DELETED
@@ -1,24 +0,0 @@
1
- {
2
- "</tool_call>": 151658,
3
- "<tool_call>": 151657,
4
- "<|box_end|>": 151649,
5
- "<|box_start|>": 151648,
6
- "<|endoftext|>": 151643,
7
- "<|file_sep|>": 151664,
8
- "<|fim_middle|>": 151660,
9
- "<|fim_pad|>": 151662,
10
- "<|fim_prefix|>": 151659,
11
- "<|fim_suffix|>": 151661,
12
- "<|im_end|>": 151645,
13
- "<|im_start|>": 151644,
14
- "<|image_pad|>": 151655,
15
- "<|object_ref_end|>": 151647,
16
- "<|object_ref_start|>": 151646,
17
- "<|quad_end|>": 151651,
18
- "<|quad_start|>": 151650,
19
- "<|repo_name|>": 151663,
20
- "<|video_pad|>": 151656,
21
- "<|vision_end|>": 151653,
22
- "<|vision_pad|>": 151654,
23
- "<|vision_start|>": 151652
24
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
chat_template.jinja DELETED
@@ -1,54 +0,0 @@
1
- {%- if tools %}
2
- {{- '<|im_start|>system\n' }}
3
- {%- if messages[0]['role'] == 'system' %}
4
- {{- messages[0]['content'] }}
5
- {%- else %}
6
- {{- 'You are a helpful assistant.' }}
7
- {%- endif %}
8
- {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
- {%- for tool in tools %}
10
- {{- "\n" }}
11
- {{- tool | tojson }}
12
- {%- endfor %}
13
- {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
- {%- else %}
15
- {%- if messages[0]['role'] == 'system' %}
16
- {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
- {%- else %}
18
- {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
19
- {%- endif %}
20
- {%- endif %}
21
- {%- for message in messages %}
22
- {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
- {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
- {%- elif message.role == "assistant" %}
25
- {{- '<|im_start|>' + message.role }}
26
- {%- if message.content %}
27
- {{- '\n' + message.content }}
28
- {%- endif %}
29
- {%- for tool_call in message.tool_calls %}
30
- {%- if tool_call.function is defined %}
31
- {%- set tool_call = tool_call.function %}
32
- {%- endif %}
33
- {{- '\n<tool_call>\n{"name": "' }}
34
- {{- tool_call.name }}
35
- {{- '", "arguments": ' }}
36
- {{- tool_call.arguments | tojson }}
37
- {{- '}\n</tool_call>' }}
38
- {%- endfor %}
39
- {{- '<|im_end|>\n' }}
40
- {%- elif message.role == "tool" %}
41
- {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
- {{- '<|im_start|>user' }}
43
- {%- endif %}
44
- {{- '\n<tool_response>\n' }}
45
- {{- message.content }}
46
- {{- '\n</tool_response>' }}
47
- {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
- {{- '<|im_end|>\n' }}
49
- {%- endif %}
50
- {%- endif %}
51
- {%- endfor %}
52
- {%- if add_generation_prompt %}
53
- {{- '<|im_start|>assistant\n' }}
54
- {%- endif %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
merges.txt DELETED
The diff for this file is too large to render. See raw diff
 
notebook.ipynb ADDED
@@ -0,0 +1,378 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# French SSML Cascade Models Demo\n",
8
+ "\n",
9
+ "<img src=\"https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png\" alt=\"Hi! Paris\" width=\"200\"/>\n",
10
+ "\n",
11
+ "**Interactive demonstration of French SSML cascade models for improved text-to-speech prosody control.**\n",
12
+ "\n",
13
+ "This notebook demonstrates the complete pipeline from plain French text to rich SSML markup with prosody control.\n",
14
+ "\n",
15
+ "## 🧩 Pipeline Overview\n",
16
+ "\n",
17
+ "1. **Text-to-Breaks**: Predicts natural pause locations \n",
18
+ "2. **Breaks-to-SSML**: Adds prosody control (pitch, rate, volume) and precise timing\n",
19
+ "\n",
20
+ "📄 **Paper**: *Improving Synthetic Speech Quality via SSML Prosody Control* (ICNLSP 2025) \n",
21
+ "🔗 **Demo & Audio Samples**: https://horstmann.tech/ssml-prosody-control/ \n",
22
+ "📚 **Models**: [hi-paris/ssml-text2breaks-fr-lora](https://huggingface.co/hi-paris/ssml-text2breaks-fr-lora) • [hi-paris/ssml-breaks2ssml-fr-lora](https://huggingface.co/hi-paris/ssml-breaks2ssml-fr-lora)\n",
23
+ "\n",
24
+ "---"
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "markdown",
29
+ "metadata": {},
30
+ "source": [
31
+ "## 🚀 Setup\n",
32
+ "\n",
33
+ "### Step 1: Mount Google Drive"
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "code",
38
+ "execution_count": 34,
39
+ "metadata": {
40
+ "colab": {
41
+ "base_uri": "https://localhost:8080/"
42
+ },
43
+ "id": "a1jNj9uK7EoL",
44
+ "outputId": "76624289-061f-4700-e397-50da9da9ee6d"
45
+ },
46
+ "outputs": [
47
+ {
48
+ "name": "stdout",
49
+ "output_type": "stream",
50
+ "text": [
51
+ "Mounted at /content/drive\n"
52
+ ]
53
+ }
54
+ ],
55
+ "source": [
56
+ "from google.colab import drive\n",
57
+ "drive.mount('/content/drive', force_remount=True)"
58
+ ]
59
+ },
60
+ {
61
+ "cell_type": "markdown",
62
+ "metadata": {},
63
+ "source": [
64
+ "### Step 2: Clone Repository"
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "code",
69
+ "execution_count": 35,
70
+ "metadata": {
71
+ "colab": {
72
+ "base_uri": "https://localhost:8080/"
73
+ },
74
+ "id": "eE3iUaX_7OLG",
75
+ "outputId": "d621b296-b12f-489a-bc1f-c7240c21646b"
76
+ },
77
+ "outputs": [
78
+ {
79
+ "name": "stderr",
80
+ "output_type": "stream",
81
+ "text": [
82
+ "shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\n",
83
+ "chdir: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\n",
84
+ "Cloning into 'cascading_model'...\n"
85
+ ]
86
+ }
87
+ ],
88
+ "source": [
89
+ "%%bash\n",
90
+ "cd /content/drive/MyDrive/\n",
91
+ "git clone https://github.com/TimLukaHorstmann/cascading_model.git"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": 36,
97
+ "metadata": {
98
+ "colab": {
99
+ "base_uri": "https://localhost:8080/"
100
+ },
101
+ "id": "vItNbMvh7ZNL",
102
+ "outputId": "31a31144-1261-4427-9d2e-089ae17689b2"
103
+ },
104
+ "outputs": [
105
+ {
106
+ "name": "stdout",
107
+ "output_type": "stream",
108
+ "text": [
109
+ "/content/drive/MyDrive/cascading_model\n"
110
+ ]
111
+ }
112
+ ],
113
+ "source": [
114
+ "%cd /content/drive/MyDrive/cascading_model/\n"
115
+ ]
116
+ },
117
+ {
118
+ "cell_type": "code",
119
+ "execution_count": 37,
120
+ "metadata": {
121
+ "colab": {
122
+ "base_uri": "https://localhost:8080/"
123
+ },
124
+ "id": "JdeuCOX_7kae",
125
+ "outputId": "f8bad5e1-92d0-4531-fbe0-ca2f29a8efd8"
126
+ },
127
+ "outputs": [
128
+ {
129
+ "name": "stdout",
130
+ "output_type": "stream",
131
+ "text": [
132
+ "breaks2ssml_inference.py\n",
133
+ "demo.py\n",
134
+ "empty_ssml_creation.py\n",
135
+ "__init__.py\n",
136
+ "pyproject.toml\n",
137
+ "README.md\n",
138
+ "requirements.txt\n",
139
+ "shared_models.py\n",
140
+ "test_models.py\n",
141
+ "text2breaks_inference.py\n"
142
+ ]
143
+ }
144
+ ],
145
+ "source": [
146
+ "%%bash\n",
147
+ "ls"
148
+ ]
149
+ },
150
+ {
151
+ "cell_type": "markdown",
152
+ "metadata": {},
153
+ "source": [
154
+ "## 🧪 Testing & Demo\n",
155
+ "\n",
156
+ "### Step 3: Verify Installation"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "code",
161
+ "execution_count": 38,
162
+ "metadata": {
163
+ "colab": {
164
+ "base_uri": "https://localhost:8080/"
165
+ },
166
+ "id": "eaBx_eh-819B",
167
+ "outputId": "2c55f4fa-f17e-49b8-b032-74d670dcd34a"
168
+ },
169
+ "outputs": [
170
+ {
171
+ "name": "stdout",
172
+ "output_type": "stream",
173
+ "text": [
174
+ "2025-08-06 12:36:48.453347: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
175
+ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
176
+ "E0000 00:00:1754483808.475278 35366 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
177
+ "E0000 00:00:1754483808.481612 35366 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
178
+ "============================================================\n",
179
+ "🧪 French SSML Models - Test Suite\n",
180
+ "============================================================\n",
181
+ "🔍 Testing imports...\n",
182
+ " ✅ PyTorch 2.5.1+cu121\n",
183
+ " ✅ Transformers 4.54.0\n",
184
+ " ✅ PEFT 0.16.0\n",
185
+ " ✅ All imports successful!\n",
186
+ "\n",
187
+ "🔧 Testing model loading...\n",
188
+ " Loading text2breaks model...\n",
189
+ "Loading checkpoint shards: 100% 4/4 [01:33<00:00, 23.46s/it]\n",
190
+ " ✅ Text2breaks model loaded\n",
191
+ " Loading breaks2ssml model...\n",
192
+ " ✅ Breaks2ssml model loaded\n",
193
+ " ✅ All models loaded successfully!\n",
194
+ "\n",
195
+ "🧪 Testing inference...\n",
196
+ " Input: Bonjour comment allez-vous ?\n",
197
+ " Testing text2breaks...\n",
198
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
199
+ " Step 1 result: Bonjour comment allez-vous ?<break/>\n",
200
+ " Testing breaks2ssml...\n",
201
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
202
+ " Step 2 result: <prosody pitch=\"+0.64%\" rate=\"-1.92%\" volume=\"-10.00%\">\n",
203
+ " Bonjour comment allez-vous ?\n",
204
+ " </prosody>\n",
205
+ " <break time=\"500ms\"/>\n",
206
+ " ✅ Inference test successful!\n",
207
+ "\n",
208
+ "🔗 Testing full cascade...\n",
209
+ " Input: Bonsoir comment ça va ?\n",
210
+ " Cascade result: <prosody pitch=\"+0.64%\" rate=\"-1.92%\" volume=\"-10.00%\">\n",
211
+ " Bonsoir comment ça va ?\n",
212
+ " </prosody>\n",
213
+ " <break time=\"500ms\"/>\n",
214
+ " ✅ Cascade test successful!\n",
215
+ "\n",
216
+ "============================================================\n",
217
+ "🎉 All tests passed! The models are working correctly.\n",
218
+ "============================================================\n",
219
+ "\n",
220
+ "You can now use:\n",
221
+ " - python demo.py (for examples)\n",
222
+ " - python demo.py --interactive (for interactive mode)\n",
223
+ " - python text2breaks_inference.py --interactive\n",
224
+ " - python breaks2ssml_inference.py --interactive\n"
225
+ ]
226
+ }
227
+ ],
228
+ "source": [
229
+ "!python test_models.py"
230
+ ]
231
+ },
232
+ {
233
+ "cell_type": "markdown",
234
+ "metadata": {},
235
+ "source": [
236
+ "### Step 4: Interactive Demo\n",
237
+ "\n",
238
+ "Run the interactive demo to test the models with your own French text:"
239
+ ]
240
+ },
241
+ {
242
+ "cell_type": "code",
243
+ "execution_count": 29,
244
+ "metadata": {
245
+ "colab": {
246
+ "base_uri": "https://localhost:8080/"
247
+ },
248
+ "id": "ZIeUY9atUhvV",
249
+ "outputId": "581f1395-fa70-424f-9c66-50b5e44547c3"
250
+ },
251
+ "outputs": [
252
+ {
253
+ "name": "stdout",
254
+ "output_type": "stream",
255
+ "text": [
256
+ "2025-08-06 12:21:35.541051: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
257
+ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
258
+ "E0000 00:00:1754482895.561958 31169 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
259
+ "E0000 00:00:1754482895.568312 31169 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
260
+ "================================================================================\n",
261
+ "Interactive French SSML Cascade\n",
262
+ "================================================================================\n",
263
+ "\n",
264
+ "Choose mode:\n",
265
+ "1. Full cascade (text → breaks → SSML)\n",
266
+ "2. Text to breaks only\n",
267
+ "3. Breaks to SSML only\n",
268
+ "\n",
269
+ "Select mode (1-3): 1\n",
270
+ "\n",
271
+ "Initializing models...\n",
272
+ "Loading checkpoint shards: 100% 4/4 [01:30<00:00, 22.70s/it]\n",
273
+ "Models loaded successfully!\n",
274
+ "\n",
275
+ "Enter French text (empty line to exit):\n",
276
+ "\n",
277
+ "> Je suis Luka.\n",
278
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
279
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
280
+ "Output: <prosody pitch=\"+0.64%\" rate=\"-1.92%\" volume=\"-10.00%\">\n",
281
+ " Je suis Luka.\n",
282
+ " </prosody>\n",
283
+ " <break time=\"500ms\"/>\n",
284
+ "Time: 6.55s\n",
285
+ "\n",
286
+ "> Trés bien.\n",
287
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
288
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
289
+ "Output: <prosody pitch=\"+0.64%\" rate=\"-1.92%\" volume=\"-10.00%\">\n",
290
+ " Trés bien.\n",
291
+ " </prosody>\n",
292
+ " <break time=\"500ms\"/>\n",
293
+ "Time: 5.64s\n",
294
+ "\n",
295
+ "> Je suis Bertrand Perier. Je suis avocat et vous écoutez ma masterclass.\n",
296
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
297
+ "The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
298
+ "Output: <prosody pitch=\"+0.64%\" rate=\"-1.92%\" volume=\"-10.00%\">\n",
299
+ " Je suis Bertrand Perier.\n",
300
+ " </prosody>\n",
301
+ " <break time=\"500ms\"/>\n",
302
+ "\n",
303
+ " <prosody pitch=\"+3.78%\" rate=\"-1.29%\" volume=\"-10.00%\">\n",
304
+ " Je suis avocat et vous écoutez ma masterclass.\n",
305
+ " </prosody>\n",
306
+ " <break time=\"500ms\"/>\n",
307
+ "Time: 12.11s\n",
308
+ "\n",
309
+ "> Exception ignored in: <module 'threading' from '/usr/lib/python3.11/threading.py'>\n",
310
+ "Traceback (most recent call last):\n",
311
+ " File \"/usr/lib/python3.11/threading.py\", line 1541, in _shutdown\n",
312
+ " def _shutdown():\n",
313
+ " \n",
314
+ "KeyboardInterrupt: \n"
315
+ ]
316
+ }
317
+ ],
318
+ "source": [
319
+ "!python demo.py --interactive"
320
+ ]
321
+ },
322
+ {
323
+ "cell_type": "markdown",
324
+ "metadata": {},
325
+ "source": [
326
+ "## 🎯 Example Usage\n",
327
+ "\n",
328
+ "```python\n",
329
+ "from breaks2ssml_inference import CascadedInference\n",
330
+ "\n",
331
+ "# Initialize the full cascade\n",
332
+ "cascade = CascadedInference()\n",
333
+ "\n",
334
+ "# Convert plain French text to SSML\n",
335
+ "text = \"Bonjour comment allez-vous aujourd'hui ?\"\n",
336
+ "result = cascade.predict(text)\n",
337
+ "print(result)\n",
338
+ "```\n",
339
+ "\n",
340
+ "**Expected Output:**\n",
341
+ "```xml\n",
342
+ "<prosody pitch=\"+2.5%\" rate=\"-1.2%\" volume=\"-5.0%\">Bonjour comment allez-vous aujourd'hui ?</prosody><break time=\"300ms\"/>\n",
343
+ "```\n",
344
+ "\n",
345
+ "## 📚 Resources\n",
346
+ "\n",
347
+ "- **Audio Demos**: https://horstmann.tech/ssml-prosody-control/\n",
348
+ "- **GitHub Repository**: https://github.com/TimLukaHorstmann/cascading_model\n",
349
+ "- **Stage 1 Model**: https://huggingface.co/hi-paris/ssml-text2breaks-fr-lora\n",
350
+ "- **Stage 2 Model**: https://huggingface.co/hi-paris/ssml-breaks2ssml-fr-lora\n",
351
+ "\n",
352
+ "---\n",
353
+ "*Hi! Paris - Interdisciplinary Research Institute for Artificial Intelligence*"
354
+ ]
355
+ },
356
+ {
357
+ "cell_type": "markdown",
358
+ "metadata": {},
359
+ "source": []
360
+ }
361
+ ],
362
+ "metadata": {
363
+ "accelerator": "GPU",
364
+ "colab": {
365
+ "gpuType": "T4",
366
+ "provenance": []
367
+ },
368
+ "kernelspec": {
369
+ "display_name": "Python 3",
370
+ "name": "python3"
371
+ },
372
+ "language_info": {
373
+ "name": "python"
374
+ }
375
+ },
376
+ "nbformat": 4,
377
+ "nbformat_minor": 0
378
+ }
special_tokens_map.json DELETED
@@ -1,31 +0,0 @@
1
- {
2
- "additional_special_tokens": [
3
- "<|im_start|>",
4
- "<|im_end|>",
5
- "<|object_ref_start|>",
6
- "<|object_ref_end|>",
7
- "<|box_start|>",
8
- "<|box_end|>",
9
- "<|quad_start|>",
10
- "<|quad_end|>",
11
- "<|vision_start|>",
12
- "<|vision_end|>",
13
- "<|vision_pad|>",
14
- "<|image_pad|>",
15
- "<|video_pad|>"
16
- ],
17
- "eos_token": {
18
- "content": "<|endoftext|>",
19
- "lstrip": false,
20
- "normalized": false,
21
- "rstrip": false,
22
- "single_word": false
23
- },
24
- "pad_token": {
25
- "content": "<|endoftext|>",
26
- "lstrip": false,
27
- "normalized": false,
28
- "rstrip": false,
29
- "single_word": false
30
- }
31
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer.json DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
- size 11421896
 
 
 
 
tokenizer_config.json DELETED
@@ -1,207 +0,0 @@
1
- {
2
- "add_bos_token": false,
3
- "add_prefix_space": false,
4
- "added_tokens_decoder": {
5
- "151643": {
6
- "content": "<|endoftext|>",
7
- "lstrip": false,
8
- "normalized": false,
9
- "rstrip": false,
10
- "single_word": false,
11
- "special": true
12
- },
13
- "151644": {
14
- "content": "<|im_start|>",
15
- "lstrip": false,
16
- "normalized": false,
17
- "rstrip": false,
18
- "single_word": false,
19
- "special": true
20
- },
21
- "151645": {
22
- "content": "<|im_end|>",
23
- "lstrip": false,
24
- "normalized": false,
25
- "rstrip": false,
26
- "single_word": false,
27
- "special": true
28
- },
29
- "151646": {
30
- "content": "<|object_ref_start|>",
31
- "lstrip": false,
32
- "normalized": false,
33
- "rstrip": false,
34
- "single_word": false,
35
- "special": true
36
- },
37
- "151647": {
38
- "content": "<|object_ref_end|>",
39
- "lstrip": false,
40
- "normalized": false,
41
- "rstrip": false,
42
- "single_word": false,
43
- "special": true
44
- },
45
- "151648": {
46
- "content": "<|box_start|>",
47
- "lstrip": false,
48
- "normalized": false,
49
- "rstrip": false,
50
- "single_word": false,
51
- "special": true
52
- },
53
- "151649": {
54
- "content": "<|box_end|>",
55
- "lstrip": false,
56
- "normalized": false,
57
- "rstrip": false,
58
- "single_word": false,
59
- "special": true
60
- },
61
- "151650": {
62
- "content": "<|quad_start|>",
63
- "lstrip": false,
64
- "normalized": false,
65
- "rstrip": false,
66
- "single_word": false,
67
- "special": true
68
- },
69
- "151651": {
70
- "content": "<|quad_end|>",
71
- "lstrip": false,
72
- "normalized": false,
73
- "rstrip": false,
74
- "single_word": false,
75
- "special": true
76
- },
77
- "151652": {
78
- "content": "<|vision_start|>",
79
- "lstrip": false,
80
- "normalized": false,
81
- "rstrip": false,
82
- "single_word": false,
83
- "special": true
84
- },
85
- "151653": {
86
- "content": "<|vision_end|>",
87
- "lstrip": false,
88
- "normalized": false,
89
- "rstrip": false,
90
- "single_word": false,
91
- "special": true
92
- },
93
- "151654": {
94
- "content": "<|vision_pad|>",
95
- "lstrip": false,
96
- "normalized": false,
97
- "rstrip": false,
98
- "single_word": false,
99
- "special": true
100
- },
101
- "151655": {
102
- "content": "<|image_pad|>",
103
- "lstrip": false,
104
- "normalized": false,
105
- "rstrip": false,
106
- "single_word": false,
107
- "special": true
108
- },
109
- "151656": {
110
- "content": "<|video_pad|>",
111
- "lstrip": false,
112
- "normalized": false,
113
- "rstrip": false,
114
- "single_word": false,
115
- "special": true
116
- },
117
- "151657": {
118
- "content": "<tool_call>",
119
- "lstrip": false,
120
- "normalized": false,
121
- "rstrip": false,
122
- "single_word": false,
123
- "special": false
124
- },
125
- "151658": {
126
- "content": "</tool_call>",
127
- "lstrip": false,
128
- "normalized": false,
129
- "rstrip": false,
130
- "single_word": false,
131
- "special": false
132
- },
133
- "151659": {
134
- "content": "<|fim_prefix|>",
135
- "lstrip": false,
136
- "normalized": false,
137
- "rstrip": false,
138
- "single_word": false,
139
- "special": false
140
- },
141
- "151660": {
142
- "content": "<|fim_middle|>",
143
- "lstrip": false,
144
- "normalized": false,
145
- "rstrip": false,
146
- "single_word": false,
147
- "special": false
148
- },
149
- "151661": {
150
- "content": "<|fim_suffix|>",
151
- "lstrip": false,
152
- "normalized": false,
153
- "rstrip": false,
154
- "single_word": false,
155
- "special": false
156
- },
157
- "151662": {
158
- "content": "<|fim_pad|>",
159
- "lstrip": false,
160
- "normalized": false,
161
- "rstrip": false,
162
- "single_word": false,
163
- "special": false
164
- },
165
- "151663": {
166
- "content": "<|repo_name|>",
167
- "lstrip": false,
168
- "normalized": false,
169
- "rstrip": false,
170
- "single_word": false,
171
- "special": false
172
- },
173
- "151664": {
174
- "content": "<|file_sep|>",
175
- "lstrip": false,
176
- "normalized": false,
177
- "rstrip": false,
178
- "single_word": false,
179
- "special": false
180
- }
181
- },
182
- "additional_special_tokens": [
183
- "<|im_start|>",
184
- "<|im_end|>",
185
- "<|object_ref_start|>",
186
- "<|object_ref_end|>",
187
- "<|box_start|>",
188
- "<|box_end|>",
189
- "<|quad_start|>",
190
- "<|quad_end|>",
191
- "<|vision_start|>",
192
- "<|vision_end|>",
193
- "<|vision_pad|>",
194
- "<|image_pad|>",
195
- "<|video_pad|>"
196
- ],
197
- "bos_token": null,
198
- "clean_up_tokenization_spaces": false,
199
- "eos_token": "<|endoftext|>",
200
- "errors": "replace",
201
- "extra_special_tokens": {},
202
- "model_max_length": 131072,
203
- "pad_token": "<|endoftext|>",
204
- "split_special_tokens": false,
205
- "tokenizer_class": "Qwen2Tokenizer",
206
- "unk_token": null
207
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
vocab.json DELETED
The diff for this file is too large to render. See raw diff