Joseph Pollack commited on
Commit
a3a3978
Β·
unverified Β·
1 Parent(s): 6434b46
docs/README.md ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Voxtral ASR Fine-tuning Documentation
2
+
3
+ ```mermaid
4
+ graph TD
5
+ %% Main Entry Point
6
+ START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
7
+
8
+ %% Documentation Categories
9
+ OVERVIEW --> ARCH[πŸ—οΈ Architecture Overview]
10
+ OVERVIEW --> WORKFLOW[πŸ”„ Interface Workflow]
11
+ OVERVIEW --> TRAINING[πŸš€ Training Pipeline]
12
+ OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
13
+ OVERVIEW --> DATAFLOW[πŸ“Š Data Flow]
14
+
15
+ %% Architecture Section
16
+ ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
17
+ ARCH --> ARCH_LINK[πŸ“„ View Details β†’](architecture.md)
18
+
19
+ %% Interface Section
20
+ WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording β†’ Training β†’ Demo]
21
+ WORKFLOW --> WORKFLOW_LINK[πŸ“„ View Details β†’](interface-workflow.md)
22
+
23
+ %% Training Section
24
+ TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data β†’ Model β†’ Results]
25
+ TRAINING --> TRAINING_LINK[πŸ“„ View Details β†’](training-pipeline.md)
26
+
27
+ %% Deployment Section
28
+ DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model β†’ Hub β†’ Space]
29
+ DEPLOYMENT --> DEPLOYMENT_LINK[πŸ“„ View Details β†’](deployment-pipeline.md)
30
+
31
+ %% Data Flow Section
32
+ DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input β†’ Processing β†’ Output]
33
+ DATAFLOW --> DATAFLOW_LINK[πŸ“„ View Details β†’](data-flow.md)
34
+
35
+ %% Key Components Highlight
36
+ subgraph "πŸŽ›οΈ Core Components"
37
+ INTERFACE[interface.py<br/>Gradio Web UI]
38
+ TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
39
+ DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
40
+ PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
41
+ end
42
+
43
+ %% Data Flow Highlight
44
+ subgraph "πŸ“ Key Data Formats"
45
+ JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
46
+ HFDATA[HF Hub Models<br/>username/model-name]
47
+ SPACES[HF Spaces<br/>Interactive Demos]
48
+ end
49
+
50
+ %% Connect components to their respective docs
51
+ INTERFACE --> WORKFLOW
52
+ TRAIN_SCRIPTS --> TRAINING
53
+ DEPLOY_SCRIPT --> DEPLOYMENT
54
+ PUSH_SCRIPT --> DEPLOYMENT
55
+
56
+ JSONL --> DATAFLOW
57
+ HFDATA --> DEPLOYMENT
58
+ SPACES --> DEPLOYMENT
59
+
60
+ %% Styling
61
+ classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
62
+ classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
63
+ classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
64
+ classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
65
+ classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
66
+ classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
67
+
68
+ class START entry
69
+ class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
70
+ class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
71
+ class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
72
+ class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
73
+ class JSONL,HFDATA,SPACES data
74
+ ```
75
+
76
+ ## Voxtral ASR Fine-tuning Application
77
+
78
+ This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows.
79
+
80
+ ### 🎯 What is Voxtral ASR Fine-tuning?
81
+
82
+ Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides:
83
+
84
+ - **πŸŽ™οΈ Easy Data Collection**: Record audio or upload files with transcripts
85
+ - **πŸš€ One-Click Training**: Fine-tune Voxtral with LoRA or full parameter updates
86
+ - **🌐 Instant Deployment**: Deploy interactive demos to Hugging Face Spaces
87
+ - **πŸ“Š Experiment Tracking**: Monitor training progress with Trackio integration
88
+
89
+ ### πŸ“š Documentation Overview
90
+
91
+ #### πŸ—οΈ [Architecture Overview](architecture.md)
92
+ High-level view of system components and their relationships:
93
+ - **User Interface Layer**: Gradio web interface
94
+ - **Data Processing Layer**: Audio processing and dataset creation
95
+ - **Training Layer**: Full and LoRA fine-tuning scripts
96
+ - **Model Management Layer**: HF Hub integration and model cards
97
+ - **Deployment Layer**: Demo space deployment
98
+
99
+ #### πŸ”„ [Interface Workflow](interface-workflow.md)
100
+ Complete user journey through the application:
101
+ - **Language Selection**: Choose from 25+ languages via NVIDIA Granary
102
+ - **Data Collection**: Record audio or upload existing files
103
+ - **Dataset Creation**: Process audio + transcripts into JSONL format
104
+ - **Training Configuration**: Set hyperparameters and options
105
+ - **Live Training**: Real-time progress monitoring
106
+ - **Auto Deployment**: One-click model publishing and demo creation
107
+
108
+ #### πŸš€ [Training Pipeline](training-pipeline.md)
109
+ Detailed training process and script interactions:
110
+ - **Data Sources**: JSONL datasets, HF Hub datasets, NVIDIA Granary
111
+ - **Data Processing**: Audio resampling, text tokenization, data collation
112
+ - **Training Scripts**: `train.py` (full) vs `train_lora.py` (parameter-efficient)
113
+ - **Infrastructure**: Trackio logging, Hugging Face Trainer, device management
114
+ - **Model Outputs**: Trained models, training logs, checkpoints
115
+
116
+ #### 🌐 [Deployment Pipeline](deployment-pipeline.md)
117
+ Model publishing and demo deployment process:
118
+ - **Model Publishing**: Push to Hugging Face Hub with metadata
119
+ - **Model Card Generation**: Automated documentation creation
120
+ - **Demo Space Deployment**: Create interactive demos on HF Spaces
121
+ - **Configuration Management**: Environment variables and secrets
122
+ - **Live Demo Features**: Real-time ASR inference interface
123
+
124
+ #### πŸ“Š [Data Flow](data-flow.md)
125
+ Complete data journey through the system:
126
+ - **Input Sources**: Microphone recordings, file uploads, external datasets
127
+ - **Processing Pipeline**: Audio resampling, text cleaning, JSONL conversion
128
+ - **Training Flow**: Dataset loading, batching, model training
129
+ - **Output Pipeline**: Model files, logs, checkpoints, published assets
130
+ - **External Integration**: HF Hub, NVIDIA Granary, Trackio Spaces
131
+
132
+ ### πŸ› οΈ Core Components
133
+
134
+ | Component | Purpose | Key Features |
135
+ |-----------|---------|--------------|
136
+ | `interface.py` | Main web application | Gradio UI, data collection, training orchestration |
137
+ | `scripts/train.py` | Full model fine-tuning | Complete parameter updates, maximum accuracy |
138
+ | `scripts/train_lora.py` | LoRA fine-tuning | Parameter-efficient, faster training, lower memory |
139
+ | `scripts/deploy_demo_space.py` | Demo deployment | Automated HF Spaces creation and configuration |
140
+ | `scripts/push_to_huggingface.py` | Model publishing | HF Hub integration, model card generation |
141
+ | `scripts/generate_model_card.py` | Documentation | Automated model card creation from templates |
142
+
143
+ ### πŸ“ Key Data Formats
144
+
145
+ #### JSONL Dataset Format
146
+ ```json
147
+ {"audio_path": "path/to/audio.wav", "text": "transcription text"}
148
+ ```
149
+
150
+ #### Training Configuration
151
+ ```json
152
+ {
153
+ "model_checkpoint": "mistralai/Voxtral-Mini-3B-2507",
154
+ "batch_size": 2,
155
+ "learning_rate": 5e-5,
156
+ "epochs": 3,
157
+ "lora_r": 8,
158
+ "lora_alpha": 32
159
+ }
160
+ ```
161
+
162
+ #### Model Repository Structure
163
+ ```
164
+ username/model-name/
165
+ β”œβ”€β”€ model.safetensors
166
+ β”œβ”€β”€ config.json
167
+ β”œβ”€β”€ tokenizer.json
168
+ β”œβ”€β”€ README.md (model card)
169
+ └── training_results/
170
+ ```
171
+
172
+ ### πŸš€ Quick Start
173
+
174
+ 1. **Set Environment Variables**:
175
+ ```bash
176
+ export HF_TOKEN=your_huggingface_token
177
+ export HF_USERNAME=your_username
178
+ ```
179
+
180
+ 2. **Launch Interface**:
181
+ ```bash
182
+ python interface.py
183
+ ```
184
+
185
+ 3. **Follow the Workflow**:
186
+ - Select language β†’ Record/upload data β†’ Configure training β†’ Start training
187
+ - Monitor progress β†’ View results β†’ Deploy demo
188
+
189
+ ### πŸ“‹ Prerequisites
190
+
191
+ - **Hardware**: NVIDIA GPU recommended for training
192
+ - **Software**: Python 3.8+, CUDA-compatible GPU drivers
193
+ - **Tokens**: Hugging Face token for model access and publishing
194
+ - **Storage**: Sufficient disk space for models and datasets
195
+
196
+ ### πŸ”§ Configuration Options
197
+
198
+ #### Training Modes
199
+ - **LoRA Fine-tuning**: Efficient, fast, lower memory usage
200
+ - **Full Fine-tuning**: Maximum accuracy, higher memory requirements
201
+
202
+ #### Data Sources
203
+ - **User Recordings**: Live microphone input
204
+ - **File Uploads**: Existing WAV/FLAC files
205
+ - **NVIDIA Granary**: High-quality multilingual datasets
206
+ - **HF Hub Datasets**: Community-contributed datasets
207
+
208
+ #### Deployment Options
209
+ - **HF Hub Publishing**: Share models publicly
210
+ - **Demo Spaces**: Interactive web demos
211
+ - **Model Cards**: Automated documentation
212
+
213
+ ### πŸ“ˆ Performance & Metrics
214
+
215
+ #### Training Metrics
216
+ - **Loss Curves**: Training and validation loss
217
+ - **Perplexity**: Model confidence measure
218
+ - **Word Error Rate**: ASR accuracy (if available)
219
+ - **Training Time**: Time to convergence
220
+
221
+ #### Resource Usage
222
+ - **GPU Memory**: Peak memory usage during training
223
+ - **Training Time**: Hours/days depending on dataset size
224
+ - **Model Size**: Disk space requirements
225
+
226
+ ### 🀝 Contributing
227
+
228
+ The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect:
229
+
230
+ - **architecture.md**: System overview and component relationships
231
+ - **interface-workflow.md**: User experience and interaction flow
232
+ - **training-pipeline.md**: Technical training process details
233
+ - **deployment-pipeline.md**: Publishing and deployment mechanics
234
+ - **data-flow.md**: Data movement and transformation
235
+
236
+ ### πŸ“„ Additional Resources
237
+
238
+ - **Hugging Face Spaces**: [Live Demo](https://huggingface.co/spaces)
239
+ - **Voxtral Models**: [Model Hub](https://huggingface.co/mistralai)
240
+ - **NVIDIA Granary**: [Dataset Documentation](https://huggingface.co/nvidia/Granary)
241
+ - **Trackio**: [Experiment Tracking](https://trackio.space)
242
+
243
+ ---
244
+
245
+ *This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.*
246
+
docs/architecture.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Voxtral ASR Fine-tuning Architecture
2
+
3
+ ```mermaid
4
+ graph TB
5
+ %% User Interface Layer
6
+ subgraph "User Interface"
7
+ UI[Gradio Web Interface<br/>interface.py]
8
+ REC[Audio Recording<br/>Microphone Input]
9
+ UP[File Upload<br/>WAV/FLAC files]
10
+ end
11
+
12
+ %% Data Processing Layer
13
+ subgraph "Data Processing"
14
+ DP[Data Processing<br/>Audio resampling<br/>JSONL creation]
15
+ DS[Dataset Management<br/>NVIDIA Granary<br/>Local datasets]
16
+ end
17
+
18
+ %% Training Layer
19
+ subgraph "Training Pipeline"
20
+ TF[Full Fine-tuning<br/>scripts/train.py]
21
+ TL[LoRA Fine-tuning<br/>scripts/train_lora.py]
22
+ TI[Trackio Integration<br/>Experiment Tracking]
23
+ end
24
+
25
+ %% Model Management Layer
26
+ subgraph "Model Management"
27
+ MM[Model Management<br/>Hugging Face Hub<br/>Local storage]
28
+ MC[Model Card Generation<br/>scripts/generate_model_card.py]
29
+ end
30
+
31
+ %% Deployment Layer
32
+ subgraph "Deployment & Demo"
33
+ DEP[Demo Space Deployment<br/>scripts/deploy_demo_space.py]
34
+ HF[HF Spaces<br/>Interactive Demo]
35
+ end
36
+
37
+ %% External Services
38
+ subgraph "External Services"
39
+ HFH[Hugging Face Hub<br/>Models & Datasets]
40
+ GRAN[NVIDIA Granary<br/>Multilingual ASR Dataset]
41
+ TRACK[Trackio Spaces<br/>Experiment Tracking]
42
+ end
43
+
44
+ %% Data Flow
45
+ UI --> DP
46
+ REC --> DP
47
+ UP --> DP
48
+ DP --> DS
49
+
50
+ DS --> TF
51
+ DS --> TL
52
+ TF --> TI
53
+ TL --> TI
54
+
55
+ TF --> MM
56
+ TL --> MM
57
+ MM --> MC
58
+
59
+ MM --> DEP
60
+ DEP --> HF
61
+
62
+ DS -.-> HFH
63
+ MM -.-> HFH
64
+ TI -.-> TRACK
65
+ DS -.-> GRAN
66
+
67
+ %% Styling
68
+ classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
69
+ classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
70
+ classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
71
+ classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
72
+ classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
73
+ classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
74
+
75
+ class UI,REC,UP interface
76
+ class DP,DS processing
77
+ class TF,TL,TI training
78
+ class MM,MC management
79
+ class DEP,HF deployment
80
+ class HFH,GRAN,TRACK external
81
+ ```
82
+
83
+ ## Architecture Overview
84
+
85
+ This diagram shows the high-level architecture of the Voxtral ASR Fine-tuning application. The system is organized into several layers:
86
+
87
+ ### 1. User Interface Layer
88
+ - **Gradio Web Interface**: Main user-facing application built with Gradio
89
+ - **Audio Recording**: Microphone input for recording speech samples
90
+ - **File Upload**: Support for uploading existing WAV/FLAC audio files
91
+
92
+ ### 2. Data Processing Layer
93
+ - **Data Processing**: Audio resampling to 16kHz, JSONL dataset creation
94
+ - **Dataset Management**: Integration with NVIDIA Granary dataset and local dataset handling
95
+
96
+ ### 3. Training Layer
97
+ - **Full Fine-tuning**: Complete model fine-tuning using `scripts/train.py`
98
+ - **LoRA Fine-tuning**: Parameter-efficient fine-tuning using `scripts/train_lora.py`
99
+ - **Trackio Integration**: Experiment tracking and logging
100
+
101
+ ### 4. Model Management Layer
102
+ - **Model Management**: Local storage and Hugging Face Hub integration
103
+ - **Model Card Generation**: Automated model card creation
104
+
105
+ ### 5. Deployment Layer
106
+ - **Demo Space Deployment**: Automated deployment to Hugging Face Spaces
107
+ - **Interactive Demo**: Live demo interface for testing fine-tuned models
108
+
109
+ ### 6. External Services
110
+ - **Hugging Face Hub**: Model and dataset storage and sharing
111
+ - **NVIDIA Granary**: High-quality multilingual ASR dataset
112
+ - **Trackio Spaces**: Experiment tracking and visualization
113
+
114
+ ## Key Workflows
115
+
116
+ 1. **Dataset Creation**: Users can record audio or upload files β†’ processed into JSONL format
117
+ 2. **Model Training**: Datasets fed into training scripts with experiment tracking
118
+ 3. **Model Publishing**: Trained models pushed to HF Hub with generated model cards
119
+ 4. **Demo Deployment**: Automated deployment of interactive demos to HF Spaces
120
+
121
+ See also:
122
+ - [Interface Workflow](interface-workflow.md)
123
+ - [Training Pipeline](training-pipeline.md)
124
+ - [Deployment Pipeline](deployment-pipeline.md)
125
+ - [Data Flow](data-flow.md)
126
+
docs/data-flow.md ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Flow
2
+
3
+ ```mermaid
4
+ flowchart TD
5
+ %% User Input Sources
6
+ subgraph "User Input"
7
+ MIC[🎀 Microphone Recording<br/>Raw audio + timestamps]
8
+ FILE[πŸ“ File Upload<br/>WAV/FLAC files]
9
+ TEXT[πŸ“ Manual Transcripts<br/>Text input]
10
+ LANG[🌍 Language Selection<br/>25+ languages]
11
+ end
12
+
13
+ %% Data Processing Pipeline
14
+ subgraph "Data Processing"
15
+ AUDIO_PROC[Audio Processing<br/>Resampling to 16kHz<br/>Format conversion]
16
+ TEXT_PROC[Text Processing<br/>Transcript validation<br/>Cleaning & formatting]
17
+ JSONL_CONV[JSONL Conversion<br/>{"audio_path": "...", "text": "..."}]
18
+ end
19
+
20
+ %% Dataset Storage
21
+ subgraph "Dataset Storage"
22
+ LOCAL_DS[Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/]
23
+ HF_DS[HF Hub Dataset<br/>username/dataset-name<br/>Public sharing]
24
+ end
25
+
26
+ %% Training Data Flow
27
+ subgraph "Training Data Pipeline"
28
+ DS_LOADER[Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()]
29
+ AUDIO_CAST[Audio Casting<br/>Audio(sampling_rate=16000)]
30
+ TRAIN_SPLIT[Train Split<br/>train_dataset]
31
+ EVAL_SPLIT[Eval Split<br/>eval_dataset]
32
+ end
33
+
34
+ %% Model Training
35
+ subgraph "Model Training"
36
+ COLLATOR[VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction]
37
+ FORWARD[Forward Pass<br/>Audio β†’ Features β†’ Text]
38
+ LOSS[Loss Calculation<br/>Masked LM loss]
39
+ BACKWARD[Backward Pass<br/>Gradient computation]
40
+ OPTIMIZE[Parameter Updates<br/>LoRA or full fine-tuning]
41
+ end
42
+
43
+ %% Training Outputs
44
+ subgraph "Training Outputs"
45
+ MODEL_FILES[Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json]
46
+ TRAINING_LOGS[Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves]
47
+ CHECKPOINTS[Checkpoints<br/>Intermediate models<br/>best model tracking]
48
+ end
49
+
50
+ %% Publishing Pipeline
51
+ subgraph "Publishing Pipeline"
52
+ HF_REPO[HF Repository<br/>username/model-name<br/>Model hosting]
53
+ MODEL_CARD[Model Card<br/>README.md<br/>Training details<br/>Usage examples]
54
+ METADATA[Training Metadata<br/>Config + results<br/>Performance metrics]
55
+ end
56
+
57
+ %% Demo Deployment
58
+ subgraph "Demo Deployment"
59
+ SPACE_REPO[HF Space Repository<br/>username/model-name-demo<br/>Demo hosting]
60
+ DEMO_APP[Demo Application<br/>Gradio interface<br/>Real-time inference]
61
+ ENV_VARS[Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets]
62
+ end
63
+
64
+ %% External Data Sources
65
+ subgraph "External Data Sources"
66
+ GRANARY[NVIDIA Granary<br/>Multilingual ASR data<br/>25+ languages]
67
+ HF_COMM[HF Community Datasets<br/>Public ASR datasets<br/>Standard formats]
68
+ end
69
+
70
+ %% Data Flow Connections
71
+ MIC --> AUDIO_PROC
72
+ FILE --> AUDIO_PROC
73
+ TEXT --> TEXT_PROC
74
+ LANG --> TEXT_PROC
75
+
76
+ AUDIO_PROC --> JSONL_CONV
77
+ TEXT_PROC --> JSONL_CONV
78
+
79
+ JSONL_CONV --> LOCAL_DS
80
+ LOCAL_DS --> HF_DS
81
+
82
+ LOCAL_DS --> DS_LOADER
83
+ HF_DS --> DS_LOADER
84
+ GRANARY --> DS_LOADER
85
+ HF_COMM --> DS_LOADER
86
+
87
+ DS_LOADER --> AUDIO_CAST
88
+ AUDIO_CAST --> TRAIN_SPLIT
89
+ AUDIO_CAST --> EVAL_SPLIT
90
+
91
+ TRAIN_SPLIT --> COLLATOR
92
+ EVAL_SPLIT --> COLLATOR
93
+
94
+ COLLATOR --> FORWARD
95
+ FORWARD --> LOSS
96
+ LOSS --> BACKWARD
97
+ BACKWARD --> OPTIMIZE
98
+
99
+ OPTIMIZE --> MODEL_FILES
100
+ OPTIMIZE --> TRAINING_LOGS
101
+ OPTIMIZE --> CHECKPOINTS
102
+
103
+ MODEL_FILES --> HF_REPO
104
+ TRAINING_LOGS --> HF_REPO
105
+ CHECKPOINTS --> HF_REPO
106
+
107
+ HF_REPO --> MODEL_CARD
108
+ TRAINING_LOGS --> MODEL_CARD
109
+
110
+ MODEL_CARD --> SPACE_REPO
111
+ HF_REPO --> SPACE_REPO
112
+ ENV_VARS --> SPACE_REPO
113
+
114
+ SPACE_REPO --> DEMO_APP
115
+
116
+ %% Styling
117
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
118
+ classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
119
+ classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
120
+ classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
121
+ classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
122
+ classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
123
+ classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
124
+ classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px
125
+
126
+ class MIC,FILE,TEXT,LANG input
127
+ class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
128
+ class LOCAL_DS,HF_DS storage
129
+ class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
130
+ class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
131
+ class HF_REPO,MODEL_CARD,METADATA publishing
132
+ class SPACE_REPO,DEMO_APP,ENV_VARS deployment
133
+ class GRANARY,HF_COMM external
134
+ ```
135
+
136
+ ## Data Flow Overview
137
+
138
+ This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo.
139
+
140
+ ### Data Input Sources
141
+
142
+ #### User-Generated Data
143
+ - **Microphone Recording**: Raw audio captured through browser microphone
144
+ - **File Upload**: Existing WAV/FLAC audio files
145
+ - **Manual Transcripts**: User-provided text transcriptions
146
+ - **Language Selection**: Influences phrase selection from NVIDIA Granary
147
+
148
+ #### External Data Sources
149
+ - **NVIDIA Granary**: High-quality multilingual ASR dataset
150
+ - **HF Community Datasets**: Public datasets from Hugging Face Hub
151
+
152
+ ### Data Processing Pipeline
153
+
154
+ #### Audio Processing
155
+ ```python
156
+ # Audio resampling and format conversion
157
+ audio_data = librosa.load(audio_path, sr=16000)
158
+ # Convert to WAV format for consistency
159
+ sf.write(output_path, audio_data, 16000)
160
+ ```
161
+
162
+ #### Text Processing
163
+ ```python
164
+ # Text cleaning and validation
165
+ text = text.strip()
166
+ # Basic validation (length, content checks)
167
+ assert len(text) > 0, "Empty transcription"
168
+ ```
169
+
170
+ #### JSONL Conversion
171
+ ```python
172
+ # Standard format for all datasets
173
+ entry = {
174
+ "audio_path": str(audio_file_path),
175
+ "text": cleaned_transcription
176
+ }
177
+ # Write to JSONL file
178
+ with open(jsonl_path, "a") as f:
179
+ f.write(json.dumps(entry) + "\n")
180
+ ```
181
+
182
+ ### Dataset Storage
183
+
184
+ #### Local Storage Structure
185
+ ```
186
+ datasets/voxtral_user/
187
+ β”œβ”€β”€ data.jsonl # Main dataset file
188
+ β”œβ”€β”€ recorded_data.jsonl # From recordings
189
+ └── wavs/ # Audio files
190
+ β”œβ”€β”€ recording_0000.wav
191
+ β”œβ”€β”€ recording_0001.wav
192
+ └── ...
193
+ ```
194
+
195
+ #### HF Hub Storage
196
+ - **Public Datasets**: Shareable with community
197
+ - **Version Control**: Dataset versioning and updates
198
+ - **Standard Metadata**: Automatic README generation
199
+
200
+ ### Training Data Pipeline
201
+
202
+ #### Dataset Loading
203
+ ```python
204
+ # Load local JSONL
205
+ ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl")
206
+
207
+ # Load HF dataset
208
+ ds = load_dataset("username/dataset-name", split="train")
209
+ ```
210
+
211
+ #### Audio Casting
212
+ ```python
213
+ # Ensure consistent sampling rate
214
+ ds = ds.cast_column("audio", Audio(sampling_rate=16000))
215
+ ```
216
+
217
+ #### Train/Eval Split
218
+ ```python
219
+ # Create train and eval datasets
220
+ train_dataset = ds.select(range(train_count))
221
+ eval_dataset = ds.select(range(train_count, train_count + eval_count))
222
+ ```
223
+
224
+ ### Training Process Flow
225
+
226
+ #### Data Collation
227
+ - **VoxtralDataCollator**: Custom collator for Voxtral model
228
+ - **Audio Processing**: Convert audio to model inputs
229
+ - **Prompt Construction**: Build `[AUDIO]...[AUDIO] <transcribe>` prompts
230
+ - **Text Tokenization**: Process transcription targets
231
+ - **Masking**: Mask audio prompt tokens during training
232
+
233
+ #### Forward Pass
234
+ 1. **Audio Input**: Raw audio waveforms
235
+ 2. **Audio Tower**: Extract audio features
236
+ 3. **Language Model**: Generate transcription autoregressively
237
+ 4. **Loss Calculation**: Compare generated vs target text
238
+
239
+ #### Backward Pass & Optimization
240
+ - **Gradient Computation**: Backpropagation
241
+ - **LoRA Updates**: Update only adapter parameters (LoRA mode)
242
+ - **Full Updates**: Update all parameters (full fine-tuning)
243
+ - **Optimizer Step**: Apply gradients with learning rate scheduling
244
+
245
+ ### Training Outputs
246
+
247
+ #### Model Files
248
+ - **model.safetensors**: Model weights (safetensors format)
249
+ - **config.json**: Model configuration
250
+ - **tokenizer.json**: Tokenizer configuration
251
+ - **generation_config.json**: Generation parameters
252
+
253
+ #### Training Logs
254
+ - **train_results.json**: Final training metrics
255
+ - **eval_results.json**: Evaluation results
256
+ - **training_config.json**: Training hyperparameters
257
+ - **trainer_state.json**: Training state and checkpoints
258
+
259
+ #### Checkpoints
260
+ - **checkpoint-XXX/**: Intermediate model snapshots
261
+ - **best-model/**: Best performing model
262
+ - **final-model/**: Final trained model
263
+
264
+ ### Publishing Pipeline
265
+
266
+ #### HF Repository Structure
267
+ ```
268
+ username/model-name/
269
+ β”œβ”€β”€ model.safetensors.index.json
270
+ β”œβ”€β”€ model-00001-of-00002.safetensors
271
+ β”œβ”€β”€ model-00002-of-00002.safetensors
272
+ β”œβ”€β”€ config.json
273
+ β”œβ”€β”€ tokenizer.json
274
+ β”œβ”€β”€ training_config.json
275
+ β”œβ”€β”€ train_results.json
276
+ β”œβ”€β”€ README.md (model card)
277
+ └── training_results/
278
+ └── training.log
279
+ ```
280
+
281
+ #### Model Card Generation
282
+ - **Template Processing**: Fill model_card.md template
283
+ - **Variable Injection**: Training config, results, metadata
284
+ - **Conditional Sections**: Handle quantized models, etc.
285
+
286
+ ### Demo Deployment
287
+
288
+ #### Space Repository Structure
289
+ ```
290
+ username/model-name-demo/
291
+ β”œβ”€β”€ app.py # Gradio demo application
292
+ β”œβ”€β”€ requirements.txt # Python dependencies
293
+ β”œβ”€β”€ README.md # Space documentation
294
+ └── .env # Environment variables
295
+ ```
296
+
297
+ #### Environment Configuration
298
+ ```python
299
+ # Space environment variables
300
+ HF_MODEL_ID=username/model-name
301
+ MODEL_NAME=Voxtral Fine-tuned Model
302
+ HF_TOKEN=read_only_token # For model access
303
+ BRAND_OWNER_NAME=username
304
+ # ... other branding variables
305
+ ```
306
+
307
+ ### Data Flow Patterns
308
+
309
+ #### Streaming vs Batch Processing
310
+ - **Training Data**: Batch processing for efficiency
311
+ - **External Datasets**: Streaming loading for memory efficiency
312
+ - **User Input**: Real-time processing with immediate feedback
313
+
314
+ #### Data Validation
315
+ - **Input Validation**: Check audio format, sampling rate, text length
316
+ - **Quality Assurance**: Filter out empty or invalid entries
317
+ - **Consistency Checks**: Ensure audio-text alignment
318
+
319
+ #### Error Handling
320
+ - **Graceful Degradation**: Fallback to local data if external sources fail
321
+ - **Retry Logic**: Automatic retry for network failures
322
+ - **Logging**: Comprehensive error logging and debugging
323
+
324
+ ### Performance Considerations
325
+
326
+ #### Memory Management
327
+ - **Streaming Loading**: Process large datasets without loading everything
328
+ - **Audio Caching**: Cache processed audio features
329
+ - **Batch Optimization**: Balance batch size with available memory
330
+
331
+ #### Storage Optimization
332
+ - **Compression**: Use efficient audio formats
333
+ - **Deduplication**: Avoid duplicate data entries
334
+ - **Cleanup**: Remove temporary files after processing
335
+
336
+ #### Network Efficiency
337
+ - **Incremental Uploads**: Upload files as they're ready
338
+ - **Resume Capability**: Resume interrupted uploads
339
+ - **Caching**: Cache frequently accessed data
340
+
341
+ ### Security & Privacy
342
+
343
+ #### Data Privacy
344
+ - **Local Processing**: Audio files processed locally when possible
345
+ - **User Consent**: Clear data usage policies
346
+ - **Anonymization**: Remove personally identifiable information
347
+
348
+ #### Access Control
349
+ - **Token Management**: Secure HF token storage
350
+ - **Repository Permissions**: Appropriate public/private settings
351
+ - **Rate Limiting**: Prevent abuse of demo interfaces
352
+
353
+ ### Monitoring & Analytics
354
+
355
+ #### Data Quality Metrics
356
+ - **Audio Quality**: Sampling rate, format validation
357
+ - **Text Quality**: Length, language detection, consistency
358
+ - **Dataset Statistics**: Size, distribution, coverage
359
+
360
+ #### Performance Metrics
361
+ - **Processing Time**: Data loading, preprocessing, training time
362
+ - **Model Metrics**: Loss, perplexity, WER (if available)
363
+ - **Resource Usage**: Memory, CPU/GPU utilization
364
+
365
+ #### User Analytics
366
+ - **Usage Patterns**: Popular languages, dataset sizes
367
+ - **Success Rates**: Training completion, deployment success
368
+ - **Error Patterns**: Common failure modes and solutions
369
+
370
+ See also:
371
+ - [Architecture Overview](architecture.md)
372
+ - [Interface Workflow](interface-workflow.md)
373
+ - [Training Pipeline](training-pipeline.md)
374
+
docs/deployment-pipeline.md ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Pipeline
2
+
3
+ ```mermaid
4
+ graph TB
5
+ %% Input Sources
6
+ subgraph "Inputs"
7
+ TRAINED_MODEL[Trained Model<br/>Local directory]
8
+ TRAINING_CONFIG[Training Config<br/>JSON/YAML]
9
+ TRAINING_RESULTS[Training Results<br/>Metrics & logs]
10
+ MODEL_METADATA[Model Metadata<br/>Name, description, etc.]
11
+ end
12
+
13
+ %% Model Publishing
14
+ subgraph "Model Publishing"
15
+ PUSH_SCRIPT[push_to_huggingface.py<br/>Model Publisher]
16
+
17
+ subgraph "Publishing Steps"
18
+ REPO_CREATION[Repository Creation<br/>HF Hub API]
19
+ FILE_UPLOAD[File Upload<br/>Model files to HF]
20
+ METADATA_UPLOAD[Metadata Upload<br/>Config & results]
21
+ end
22
+ end
23
+
24
+ %% Model Card Generation
25
+ subgraph "Model Card Generation"
26
+ CARD_SCRIPT[generate_model_card.py<br/>Card Generator]
27
+
28
+ subgraph "Card Components"
29
+ TEMPLATE_LOAD[Template Loading<br/>model_card.md]
30
+ VARIABLE_REPLACEMENT[Variable Replacement<br/>Config injection]
31
+ CONDITIONAL_PROCESSING[Conditional Sections<br/>Quantized models, etc.]
32
+ end
33
+ end
34
+
35
+ %% Demo Space Deployment
36
+ subgraph "Demo Space Deployment"
37
+ DEPLOY_SCRIPT[deploy_demo_space.py<br/>Space Deployer]
38
+
39
+ subgraph "Space Setup"
40
+ SPACE_CREATION[Space Repository<br/>Create HF Space]
41
+ TEMPLATE_COPY[Template Copying<br/>demo_voxtral/ files]
42
+ ENV_INJECTION[Environment Setup<br/>Model config injection]
43
+ SECRET_SETUP[Secret Configuration<br/>HF_TOKEN, model vars]
44
+ end
45
+ end
46
+
47
+ %% Space Building & Testing
48
+ subgraph "Space Building"
49
+ BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
50
+ DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
51
+ MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
52
+ APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
53
+ end
54
+
55
+ %% Live Demo
56
+ subgraph "Live Demo Space"
57
+ GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
58
+ MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
59
+ USER_INTERACTION[User Interaction<br/>Audio upload/playback]
60
+ end
61
+
62
+ %% External Services
63
+ subgraph "External Services"
64
+ HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
65
+ HF_SPACES[HF Spaces Platform<br/>Demo hosting]
66
+ end
67
+
68
+ %% Flow Connections
69
+ TRAINED_MODEL --> PUSH_SCRIPT
70
+ TRAINING_CONFIG --> PUSH_SCRIPT
71
+ TRAINING_RESULTS --> PUSH_SCRIPT
72
+ MODEL_METADATA --> PUSH_SCRIPT
73
+
74
+ PUSH_SCRIPT --> REPO_CREATION
75
+ REPO_CREATION --> FILE_UPLOAD
76
+ FILE_UPLOAD --> METADATA_UPLOAD
77
+
78
+ METADATA_UPLOAD --> CARD_SCRIPT
79
+ TRAINING_CONFIG --> CARD_SCRIPT
80
+ TRAINING_RESULTS --> CARD_SCRIPT
81
+
82
+ CARD_SCRIPT --> TEMPLATE_LOAD
83
+ TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
84
+ VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
85
+
86
+ CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
87
+ METADATA_UPLOAD --> DEPLOY_SCRIPT
88
+
89
+ DEPLOY_SCRIPT --> SPACE_CREATION
90
+ SPACE_CREATION --> TEMPLATE_COPY
91
+ TEMPLATE_COPY --> ENV_INJECTION
92
+ ENV_INJECTION --> SECRET_SETUP
93
+
94
+ SECRET_SETUP --> BUILD_TRIGGER
95
+ BUILD_TRIGGER --> DEPENDENCY_INSTALL
96
+ DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
97
+ MODEL_DOWNLOAD --> APP_INITIALIZATION
98
+
99
+ APP_INITIALIZATION --> GRADIO_INTERFACE
100
+ GRADIO_INTERFACE --> MODEL_INFERENCE
101
+ MODEL_INFERENCE --> USER_INTERACTION
102
+
103
+ HF_HUB --> MODEL_DOWNLOAD
104
+ HF_SPACES --> GRADIO_INTERFACE
105
+
106
+ %% Styling
107
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
108
+ classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
109
+ classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
110
+ classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
111
+ classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
112
+ classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
113
+ classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
114
+
115
+ class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
116
+ class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
117
+ class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
118
+ class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
119
+ class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
120
+ class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
121
+ class HF_HUB,HF_SPACES external
122
+ ```
123
+
124
+ ## Deployment Pipeline Overview
125
+
126
+ This diagram illustrates the complete deployment pipeline that takes a trained Voxtral model and makes it available as an interactive demo on Hugging Face Spaces.
127
+
128
+ ### Input Sources
129
+
130
+ #### Trained Model Artifacts
131
+ - **Model Files**: `model.safetensors`, `config.json`, `tokenizer.json`
132
+ - **Training Config**: Hyperparameters and training setup
133
+ - **Training Results**: Metrics, loss curves, evaluation results
134
+ - **Model Metadata**: Name, description, base model information
135
+
136
+ ### Model Publishing Phase
137
+
138
+ #### push_to_huggingface.py Script
139
+ ```python
140
+ # Initialize publisher
141
+ pusher = HuggingFacePusher(
142
+ model_path=output_dir,
143
+ repo_name=repo_name,
144
+ token=hf_token
145
+ )
146
+
147
+ # Push model
148
+ success = pusher.push_model(training_config, results)
149
+ ```
150
+
151
+ #### Publishing Steps
152
+ 1. **Repository Creation**: Create HF Hub repository
153
+ 2. **File Upload**: Upload all model files
154
+ 3. **Metadata Upload**: Upload training config and results
155
+
156
+ ### Model Card Generation
157
+
158
+ #### generate_model_card.py Script
159
+ ```python
160
+ # Create generator
161
+ generator = ModelCardGenerator()
162
+
163
+ # Generate card
164
+ variables = {
165
+ "model_name": model_name,
166
+ "repo_name": repo_id,
167
+ "base_model": base_model,
168
+ # ... other variables
169
+ }
170
+ content = generator.generate_model_card(variables)
171
+ ```
172
+
173
+ #### Card Processing
174
+ 1. **Template Loading**: Load from `templates/model_card.md`
175
+ 2. **Variable Replacement**: Inject actual values
176
+ 3. **Conditional Processing**: Handle optional sections
177
+
178
+ ### Demo Space Deployment
179
+
180
+ #### deploy_demo_space.py Script
181
+ ```python
182
+ # Initialize deployer
183
+ deployer = DemoSpaceDeployer(
184
+ hf_token=token,
185
+ hf_username=username,
186
+ model_id=model_id,
187
+ demo_type="voxtral"
188
+ )
189
+
190
+ # Deploy space
191
+ success = deployer.deploy()
192
+ ```
193
+
194
+ #### Space Setup Process
195
+ 1. **Space Creation**: Create HF Space repository
196
+ 2. **Template Copying**: Copy demo template files
197
+ 3. **Environment Injection**: Set model-specific variables
198
+ 4. **Secret Configuration**: Configure HF_TOKEN and model variables
199
+
200
+ ### Space Building Process
201
+
202
+ #### Automatic Build Trigger
203
+ - **Dependency Installation**: `pip install -r requirements.txt`
204
+ - **Model Download**: Download model from HF Hub
205
+ - **App Initialization**: Setup Gradio application
206
+
207
+ #### Demo Template Structure
208
+ ```
209
+ templates/spaces/demo_voxtral/
210
+ β”œβ”€β”€ app.py # Main Gradio application
211
+ β”œβ”€β”€ requirements.txt # Python dependencies
212
+ └── README.md # Space documentation
213
+ ```
214
+
215
+ ### Live Demo Features
216
+
217
+ #### Gradio Interface
218
+ - **Audio Upload**: File upload or recording
219
+ - **Real-time Inference**: Live ASR transcription
220
+ - **Interactive Controls**: Model parameters, settings
221
+
222
+ #### Model Inference Pipeline
223
+ - **Audio Processing**: Convert to model inputs
224
+ - **Transcription Generation**: Run ASR inference
225
+ - **Result Display**: Show transcription with confidence
226
+
227
+ ### Configuration Management
228
+
229
+ #### Environment Variables
230
+ ```python
231
+ # Set in Space secrets/environment
232
+ os.environ['HF_MODEL_ID'] = model_id
233
+ os.environ['MODEL_NAME'] = model_name
234
+ os.environ['HF_TOKEN'] = token # For model access
235
+ ```
236
+
237
+ #### Demo-Specific Settings
238
+ - **Model Configuration**: Base model, subfolder, quantization
239
+ - **UI Branding**: Custom titles, descriptions, links
240
+ - **Example Prompts**: Pre-configured demo examples
241
+
242
+ ### Error Handling & Monitoring
243
+
244
+ #### Build Process Monitoring
245
+ - **Build Logs**: Real-time build status
246
+ - **Error Detection**: Failed dependency installation
247
+ - **Retry Logic**: Automatic rebuild on failure
248
+
249
+ #### Runtime Monitoring
250
+ - **Space Health**: Uptime and responsiveness
251
+ - **Model Loading**: Successful model initialization
252
+ - **Inference Errors**: Runtime error handling
253
+
254
+ ### Security Considerations
255
+
256
+ #### Token Management
257
+ - **Read-Only Tokens**: Use read-only tokens for demo spaces
258
+ - **Secret Storage**: Secure storage of HF_TOKEN
259
+ - **Access Control**: Proper repository permissions
260
+
261
+ #### Resource Management
262
+ - **Memory Limits**: Space hardware constraints
263
+ - **Timeout Handling**: Inference timeout protection
264
+ - **Rate Limiting**: Prevent abuse
265
+
266
+ ### Integration Points
267
+
268
+ #### With Training Scripts
269
+ - **Training Config**: Used for model card generation
270
+ - **Training Results**: Included in model metadata
271
+ - **Model Path**: Direct path to trained model files
272
+
273
+ #### With Interface (interface.py)
274
+ - **Parameter Passing**: Deployment settings from UI
275
+ - **Progress Updates**: Deployment progress to user
276
+ - **Result Links**: Direct links to deployed spaces
277
+
278
+ ### Deployment Workflows
279
+
280
+ #### Full Pipeline (Recommended)
281
+ 1. Train model β†’ Generate model card β†’ Push to Hub β†’ Deploy demo
282
+ 2. All steps automated through single interface action
283
+ 3. Comprehensive error handling and rollback
284
+
285
+ #### Manual Deployment
286
+ 1. Use individual scripts for granular control
287
+ 2. Custom configuration and branding
288
+ 3. Debugging and troubleshooting capabilities
289
+
290
+ #### CI/CD Integration
291
+ - **Automated Triggers**: GitHub Actions integration
292
+ - **Version Control**: Model versioning and releases
293
+ - **Testing**: Automated demo testing
294
+
295
+ ### Performance Optimization
296
+
297
+ #### Space Hardware Selection
298
+ - **CPU Basic**: Free tier, sufficient for small models
299
+ - **GPU Options**: For larger models requiring acceleration
300
+ - **Memory Scaling**: Based on model size requirements
301
+
302
+ #### Model Optimization
303
+ - **Quantization**: 4-bit quantization for smaller footprint
304
+ - **Model Sharding**: Split large models across memory
305
+ - **Caching**: Model caching for faster cold starts
306
+
307
+ ### Monitoring & Analytics
308
+
309
+ #### Space Analytics
310
+ - **Usage Metrics**: Daily active users, session duration
311
+ - **Performance Metrics**: Inference latency, error rates
312
+ - **User Feedback**: Demo effectiveness and issues
313
+
314
+ #### Model Analytics
315
+ - **Download Stats**: Model popularity and usage
316
+ - **Citation Tracking**: Academic and research usage
317
+ - **Community Feedback**: GitHub issues and discussions
318
+
319
+ See also:
320
+ - [Architecture Overview](architecture.md)
321
+ - [Training Pipeline](training-pipeline.md)
322
+ - [Data Flow](data-flow.md)
323
+
docs/diagrams.html ADDED
@@ -0,0 +1,728 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Voxtral ASR Fine-tuning - Architecture Diagrams</title>
7
+ <script type="module">
8
+ import mermaid from 'https://cdn.jsdelivr.net/npm/[email protected]/dist/mermaid.esm.min.mjs';
9
+ mermaid.initialize({
10
+ startOnLoad: true,
11
+ theme: 'base',
12
+ themeVariables: {
13
+ primaryColor: '#e3f2fd',
14
+ primaryTextColor: '#1976d2',
15
+ primaryBorderColor: '#01579b',
16
+ lineColor: '#424242',
17
+ secondaryColor: '#fff3e0',
18
+ tertiaryColor: '#fce4ec',
19
+ background: '#ffffff',
20
+ mainBkg: '#ffffff',
21
+ secondBkg: '#f5f5f5',
22
+ textColor: '#333333'
23
+ },
24
+ flowchart: {
25
+ useMaxWidth: true,
26
+ htmlLabels: true,
27
+ curve: 'basis'
28
+ },
29
+ sequence: {
30
+ useMaxWidth: true
31
+ }
32
+ });
33
+ </script>
34
+ <style>
35
+ body {
36
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
37
+ line-height: 1.6;
38
+ color: #333;
39
+ max-width: 1200px;
40
+ margin: 0 auto;
41
+ padding: 20px;
42
+ background: #f8f9fa;
43
+ }
44
+
45
+ .header {
46
+ text-align: center;
47
+ margin-bottom: 40px;
48
+ padding: 20px;
49
+ background: white;
50
+ border-radius: 8px;
51
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
52
+ }
53
+
54
+ .diagram-container {
55
+ background: white;
56
+ margin: 20px 0;
57
+ padding: 20px;
58
+ border-radius: 8px;
59
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
60
+ }
61
+
62
+ .diagram-title {
63
+ font-size: 1.5em;
64
+ font-weight: bold;
65
+ margin-bottom: 15px;
66
+ color: #1976d2;
67
+ border-bottom: 2px solid #e3f2fd;
68
+ padding-bottom: 10px;
69
+ }
70
+
71
+ .diagram-description {
72
+ margin-bottom: 20px;
73
+ color: #666;
74
+ font-style: italic;
75
+ }
76
+
77
+ .navigation {
78
+ position: fixed;
79
+ top: 20px;
80
+ right: 20px;
81
+ background: white;
82
+ padding: 15px;
83
+ border-radius: 8px;
84
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
85
+ max-width: 200px;
86
+ }
87
+
88
+ .nav-link {
89
+ display: block;
90
+ padding: 8px 0;
91
+ color: #1976d2;
92
+ text-decoration: none;
93
+ border-bottom: 1px solid #eee;
94
+ }
95
+
96
+ .nav-link:hover {
97
+ color: #01579b;
98
+ text-decoration: underline;
99
+ }
100
+
101
+ .nav-link:last-child {
102
+ border-bottom: none;
103
+ }
104
+
105
+ .code-toggle {
106
+ background: #f5f5f5;
107
+ border: 1px solid #ddd;
108
+ padding: 10px;
109
+ margin: 10px 0;
110
+ border-radius: 4px;
111
+ cursor: pointer;
112
+ font-size: 0.9em;
113
+ }
114
+
115
+ .mermaid-code {
116
+ display: none;
117
+ background: #f8f9fa;
118
+ border: 1px solid #dee2e6;
119
+ border-radius: 4px;
120
+ padding: 15px;
121
+ margin: 10px 0;
122
+ font-family: 'Courier New', monospace;
123
+ font-size: 0.85em;
124
+ white-space: pre-wrap;
125
+ overflow-x: auto;
126
+ }
127
+
128
+ .download-btn {
129
+ background: #1976d2;
130
+ color: white;
131
+ border: none;
132
+ padding: 8px 16px;
133
+ border-radius: 4px;
134
+ cursor: pointer;
135
+ font-size: 0.9em;
136
+ margin: 10px 5px 10px 0;
137
+ }
138
+
139
+ .download-btn:hover {
140
+ background: #01579b;
141
+ }
142
+
143
+ @media print {
144
+ .navigation, .code-toggle, .download-btn {
145
+ display: none;
146
+ }
147
+ .diagram-container {
148
+ break-inside: avoid;
149
+ margin: 10px 0;
150
+ }
151
+ }
152
+ </style>
153
+ </head>
154
+ <body>
155
+ <div class="header">
156
+ <h1>🎯 Voxtral ASR Fine-tuning</h1>
157
+ <h2>Architecture & Workflow Diagrams</h2>
158
+ <p>Interactive documentation with Mermaid diagrams</p>
159
+ </div>
160
+
161
+ <nav class="navigation">
162
+ <strong>Quick Navigation</strong>
163
+ <a href="#overview" class="nav-link">Overview</a>
164
+ <a href="#architecture" class="nav-link">Architecture</a>
165
+ <a href="#interface" class="nav-link">Interface Workflow</a>
166
+ <a href="#training" class="nav-link">Training Pipeline</a>
167
+ <a href="#deployment" class="nav-link">Deployment Pipeline</a>
168
+ <a href="#dataflow" class="nav-link">Data Flow</a>
169
+ </nav>
170
+
171
+ <div id="overview" class="diagram-container">
172
+ <div class="diagram-title">πŸ“‹ Documentation Overview</div>
173
+ <div class="diagram-description">
174
+ High-level overview of the Voxtral ASR Fine-tuning application and its documentation structure.
175
+ </div>
176
+ <div class="mermaid">
177
+ graph TD
178
+ START(["Voxtral ASR Fine-tuning App"]) --> OVERVIEW{Choose Documentation}
179
+
180
+ OVERVIEW --> ARCH["Architecture Overview"]
181
+ OVERVIEW --> WORKFLOW["Interface Workflow"]
182
+ OVERVIEW --> TRAINING["Training Pipeline"]
183
+ OVERVIEW --> DEPLOYMENT["Deployment Pipeline"]
184
+ OVERVIEW --> DATAFLOW["Data Flow"]
185
+
186
+ ARCH --> ARCH_DIAG["High-level Architecture<br/>System Components & Layers"]
187
+ WORKFLOW --> WORKFLOW_DIAG["User Journey<br/>Recording β†’ Training β†’ Demo"]
188
+ TRAINING --> TRAINING_DIAG["Training Scripts<br/>Data β†’ Model β†’ Results"]
189
+ DEPLOYMENT --> DEPLOYMENT_DIAG["Publishing & Demo<br/>Model β†’ Hub β†’ Space"]
190
+ DATAFLOW --> DATAFLOW_DIAG["Complete Data Journey<br/>Input β†’ Processing β†’ Output"]
191
+
192
+ subgraph "Core Components"
193
+ INTERFACE["interface.py<br/>Gradio Web UI"]
194
+ TRAIN_SCRIPTS["scripts/train*.py<br/>Training Scripts"]
195
+ DEPLOY_SCRIPT["scripts/deploy_demo_space.py<br/>Demo Deployment"]
196
+ PUSH_SCRIPT["scripts/push_to_huggingface.py<br/>Model Publishing"]
197
+ end
198
+
199
+ subgraph "Key Data Formats"
200
+ JSONL["JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
201
+ HFDATA["HF Hub Models<br/>username/model-name"]
202
+ SPACES["HF Spaces<br/>Interactive Demos"]
203
+ end
204
+
205
+ INTERFACE --> WORKFLOW
206
+ TRAIN_SCRIPTS --> TRAINING
207
+ DEPLOY_SCRIPT --> DEPLOYMENT
208
+ PUSH_SCRIPT --> DEPLOYMENT
209
+
210
+ JSONL --> DATAFLOW
211
+ HFDATA --> DEPLOYMENT
212
+ SPACES --> DEPLOYMENT
213
+
214
+ classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
215
+ classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
216
+ classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
217
+ classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
218
+ classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
219
+
220
+ class START entry
221
+ class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
222
+ class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
223
+ class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
224
+ class JSONL,HFDATA,SPACES data
225
+ </div>
226
+ </div>
227
+
228
+ <div id="architecture" class="diagram-container">
229
+ <div class="diagram-title">System Architecture</div>
230
+ <div class="diagram-description">
231
+ High-level architecture showing the main components and their relationships in the Voxtral ASR Fine-tuning application.
232
+ </div>
233
+ <div class="mermaid">
234
+ graph TB
235
+ subgraph "User Interface"
236
+ UI["Gradio Web Interface<br/>interface.py"]
237
+ REC["Audio Recording<br/>Microphone Input"]
238
+ UP["File Upload<br/>WAV/FLAC files"]
239
+ end
240
+
241
+ subgraph "Data Processing"
242
+ DP["Data Processing<br/>Audio resampling<br/>JSONL creation"]
243
+ DS["Dataset Management<br/>NVIDIA Granary<br/>Local datasets"]
244
+ end
245
+
246
+ subgraph "Training Pipeline"
247
+ TF["Full Fine-tuning<br/>scripts/train.py"]
248
+ TL["LoRA Fine-tuning<br/>scripts/train_lora.py"]
249
+ TI["Trackio Integration<br/>Experiment Tracking"]
250
+ end
251
+
252
+ subgraph "Model Management"
253
+ MM["Model Management<br/>Hugging Face Hub<br/>Local storage"]
254
+ MC["Model Card Generation<br/>scripts/generate_model_card.py"]
255
+ end
256
+
257
+ subgraph "Deployment &amp; Demo"
258
+ DEP["Demo Space Deployment<br/>scripts/deploy_demo_space.py"]
259
+ HF["HF Spaces<br/>Interactive Demo"]
260
+ end
261
+
262
+ subgraph "External Services"
263
+ HFH["Hugging Face Hub<br/>Models & Datasets"]
264
+ GRAN["NVIDIA Granary<br/>Multilingual ASR Dataset"]
265
+ TRACK["Trackio Spaces<br/>Experiment Tracking"]
266
+ end
267
+
268
+ UI --> DP
269
+ REC --> DP
270
+ UP --> DP
271
+ DP --> DS
272
+
273
+ DS --> TF
274
+ DS --> TL
275
+ TF --> TI
276
+ TL --> TI
277
+
278
+ TF --> MM
279
+ TL --> MM
280
+ MM --> MC
281
+
282
+ MM --> DEP
283
+ DEP --> HF
284
+
285
+ DS -.-> HFH
286
+ MM -.-> HFH
287
+ TI -.-> TRACK
288
+ DS -.-> GRAN
289
+
290
+ classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
291
+ classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
292
+ classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
293
+ classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
294
+ classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
295
+ classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
296
+
297
+ class UI,REC,UP interface
298
+ class DP,DS processing
299
+ class TF,TL,TI training
300
+ class MM,MC management
301
+ class DEP,HF deployment
302
+ class HFH,GRAN,TRACK external
303
+ </div>
304
+ </div>
305
+
306
+ <div id="interface" class="diagram-container">
307
+ <div class="diagram-title">Interface Workflow</div>
308
+ <div class="diagram-description">
309
+ Complete user journey through the Voxtral ASR Fine-tuning interface, from language selection to demo deployment.
310
+ </div>
311
+ <div class="mermaid">
312
+ flowchart TD
313
+ START(["User Opens Interface"]) --> LANG["Language Selection<br/>Choose from 25+ languages"]
314
+ LANG --> PHRASES["Load Phrases<br/>From NVIDIA Granary"]
315
+ PHRASES --> RECORD["Recording Interface<br/>Display phrases + audio recording"]
316
+
317
+ RECORD --> |User Records| PROCESS_REC["Process Recordings<br/>Save WAV files + transcripts"]
318
+ RECORD --> |Upload Files| PROCESS_UPLOAD["Process Uploads<br/>Handle existing files + transcripts"]
319
+
320
+ PROCESS_REC --> JSONL["Create JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
321
+ PROCESS_UPLOAD --> JSONL
322
+
323
+ JSONL --> CONFIG["Training Configuration<br/>Model, LoRA/full, hyperparameters"]
324
+ CONFIG --> TRAIN["Training Process<br/>Execute train.py or train_lora.py"]
325
+
326
+ TRAIN --> PUSH["Push to Hub<br/>Model + metadata to HF Hub"]
327
+ TRAIN --> CARD["Generate Model Card<br/>Automated documentation"]
328
+ PUSH --> DEPLOY["Deploy Demo Space<br/>Interactive demo on HF Spaces"]
329
+
330
+ DEPLOY --> END(["Demo Ready<br/>Interactive ASR Demo"])
331
+
332
+ PUSH -.-> END
333
+ CARD -.-> END
334
+
335
+ classDef start fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
336
+ classDef process fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
337
+ classDef decision fill:#fff3e0,stroke:#f57c00,stroke-width:2px
338
+ classDef terminal fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
339
+
340
+ class START start
341
+ class END terminal
342
+ class LANG,PHRASES,RECORD,PROCESS_REC,PROCESS_UPLOAD,JSONL,CONFIG,TRAIN,PUSH,CARD,DEPLOY process
343
+ </div>
344
+ </div>
345
+
346
+ <div id="training" class="diagram-container">
347
+ <div class="diagram-title">Training Pipeline</div>
348
+ <div class="diagram-description">
349
+ Detailed training pipeline showing how data flows through training scripts and supporting infrastructure.
350
+ </div>
351
+ <div class="mermaid">
352
+ graph TB
353
+ subgraph "Data Sources"
354
+ JSONL["JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
355
+ GRANARY["NVIDIA Granary Dataset<br/>Multilingual ASR Data"]
356
+ HFDATA["HF Hub Datasets<br/>Community Datasets"]
357
+ end
358
+
359
+ subgraph "Data Processing"
360
+ LOADER["Dataset Loader<br/>_load_jsonl_dataset()"]
361
+ CASTER["Audio Casting<br/>16kHz resampling"]
362
+ COLLATOR["VoxtralDataCollator<br/>Audio + Text Processing"]
363
+ end
364
+
365
+ subgraph "Training Scripts"
366
+ TRAIN_FULL["Full Fine-tuning<br/>scripts/train.py"]
367
+ TRAIN_LORA["LoRA Fine-tuning<br/>scripts/train_lora.py"]
368
+
369
+ subgraph "Training Components"
370
+ MODEL_INIT["Model Initialization<br/>VoxtralForConditionalGeneration"]
371
+ LORA_CONFIG["LoRA Configuration<br/>LoraConfig + get_peft_model"]
372
+ PROCESSOR_INIT["Processor Initialization<br/>VoxtralProcessor"]
373
+ end
374
+ end
375
+
376
+ subgraph "Training Infrastructure"
377
+ TRACKIO_INIT["Trackio Integration<br/>Experiment Tracking"]
378
+ HF_TRAINER["Hugging Face Trainer<br/>TrainingArguments + Trainer"]
379
+ TORCH_DEVICE["Torch Device Setup<br/>GPU/CPU Detection"]
380
+ end
381
+
382
+ subgraph "Training Process"
383
+ FORWARD_PASS["Forward Pass<br/>Audio Processing + Generation"]
384
+ LOSS_CALC["Loss Calculation<br/>Masked Language Modeling"]
385
+ BACKWARD_PASS["Backward Pass<br/>Gradient Computation"]
386
+ OPTIMIZER_STEP["Optimizer Step<br/>Parameter Updates"]
387
+ LOGGING["Metrics Logging<br/>Loss, Perplexity, etc."]
388
+ end
389
+
390
+ subgraph "Model Management"
391
+ CHECKPOINT_SAVING["Checkpoint Saving<br/>Model snapshots"]
392
+ MODEL_SAVING["Final Model Saving<br/>Processor + Model"]
393
+ LOCAL_STORAGE["Local Storage<br/>outputs/ directory"]
394
+ end
395
+
396
+ LOADER --> CASTER
397
+ CASTER --> COLLATOR
398
+
399
+ COLLATOR --> TRAIN_FULL
400
+ COLLATOR --> TRAIN_LORA
401
+
402
+ TRAIN_FULL --> MODEL_INIT
403
+ TRAIN_LORA --> MODEL_INIT
404
+ TRAIN_LORA --> LORA_CONFIG
405
+
406
+ MODEL_INIT --> PROCESSOR_INIT
407
+ LORA_CONFIG --> PROCESSOR_INIT
408
+
409
+ PROCESSOR_INIT --> TRACKIO_INIT
410
+ PROCESSOR_INIT --> HF_TRAINER
411
+ PROCESSOR_INIT --> TORCH_DEVICE
412
+
413
+ TRACKIO_INIT --> HF_TRAINER
414
+ TORCH_DEVICE --> HF_TRAINER
415
+
416
+ HF_TRAINER --> FORWARD_PASS
417
+ FORWARD_PASS --> LOSS_CALC
418
+ LOSS_CALC --> BACKWARD_PASS
419
+ BACKWARD_PASS --> OPTIMIZER_STEP
420
+ OPTIMIZER_STEP --> LOGGING
421
+
422
+ LOGGING --> CHECKPOINT_SAVING
423
+ LOGGING --> TRACKIO_INIT
424
+
425
+ HF_TRAINER --> MODEL_SAVING
426
+ MODEL_SAVING --> LOCAL_STORAGE
427
+
428
+ JSONL --> LOADER
429
+ GRANARY --> LOADER
430
+ HFDATA --> LOADER
431
+
432
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
433
+ classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
434
+ classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
435
+ classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
436
+ classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
437
+ classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
438
+
439
+ class JSONL,GRANARY,HFDATA input
440
+ class LOADER,CASTER,COLLATOR processing
441
+ class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
442
+ class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
443
+ class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
444
+ class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
445
+ </div>
446
+ </div>
447
+
448
+ <div id="deployment" class="diagram-container">
449
+ <div class="diagram-title">Deployment Pipeline</div>
450
+ <div class="diagram-description">
451
+ Model publishing and demo deployment process from trained model to live interactive demo.
452
+ </div>
453
+ <div class="mermaid">
454
+ graph TB
455
+ subgraph "Inputs"
456
+ TRAINED_MODEL["Trained Model<br/>Local directory"]
457
+ TRAINING_CONFIG["Training Config<br/>JSON/YAML"]
458
+ TRAINING_RESULTS["Training Results<br/>Metrics & logs"]
459
+ MODEL_METADATA["Model Metadata<br/>Name, description, etc."]
460
+ end
461
+
462
+ subgraph "Model Publishing"
463
+ PUSH_SCRIPT["push_to_huggingface.py<br/>Model Publisher"]
464
+
465
+ subgraph "Publishing Steps"
466
+ REPO_CREATION["Repository Creation<br/>HF Hub API"]
467
+ FILE_UPLOAD["File Upload<br/>Model files to HF"]
468
+ METADATA_UPLOAD["Metadata Upload<br/>Config & results"]
469
+ end
470
+ end
471
+
472
+ subgraph "Model Card Generation"
473
+ CARD_SCRIPT["generate_model_card.py<br/>Card Generator"]
474
+
475
+ subgraph "Card Components"
476
+ TEMPLATE_LOAD["Template Loading<br/>model_card.md"]
477
+ VARIABLE_REPLACEMENT["Variable Replacement<br/>Config injection"]
478
+ CONDITIONAL_PROCESSING["Conditional Sections<br/>Quantized models, etc."]
479
+ end
480
+ end
481
+
482
+ subgraph "Demo Space Deployment"
483
+ DEPLOY_SCRIPT["deploy_demo_space.py<br/>Space Deployer"]
484
+
485
+ subgraph "Space Setup"
486
+ SPACE_CREATION["Space Repository<br/>Create HF Space"]
487
+ TEMPLATE_COPY["Template Copying<br/>demo_voxtral/ files"]
488
+ ENV_INJECTION["Environment Setup<br/>Model config injection"]
489
+ SECRET_SETUP["Secret Configuration<br/>HF_TOKEN, model vars"]
490
+ end
491
+ end
492
+
493
+ subgraph "Space Building"
494
+ BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
495
+ DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
496
+ MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
497
+ APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
498
+ end
499
+
500
+ subgraph "Live Demo Space"
501
+ GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
502
+ MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
503
+ USER_INTERACTION[User Interaction<br/>Audio upload/playback]
504
+ end
505
+
506
+ subgraph "External Services"
507
+ HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
508
+ HF_SPACES[HF Spaces Platform<br/>Demo hosting]
509
+ end
510
+
511
+ TRAINED_MODEL --> PUSH_SCRIPT
512
+ TRAINING_CONFIG --> PUSH_SCRIPT
513
+ TRAINING_RESULTS --> PUSH_SCRIPT
514
+ MODEL_METADATA --> PUSH_SCRIPT
515
+
516
+ PUSH_SCRIPT --> REPO_CREATION
517
+ REPO_CREATION --> FILE_UPLOAD
518
+ FILE_UPLOAD --> METADATA_UPLOAD
519
+
520
+ METADATA_UPLOAD --> CARD_SCRIPT
521
+ TRAINING_CONFIG --> CARD_SCRIPT
522
+ TRAINING_RESULTS --> CARD_SCRIPT
523
+
524
+ CARD_SCRIPT --> TEMPLATE_LOAD
525
+ TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
526
+ VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
527
+
528
+ CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
529
+ METADATA_UPLOAD --> DEPLOY_SCRIPT
530
+
531
+ DEPLOY_SCRIPT --> SPACE_CREATION
532
+ SPACE_CREATION --> TEMPLATE_COPY
533
+ TEMPLATE_COPY --> ENV_INJECTION
534
+ ENV_INJECTION --> SECRET_SETUP
535
+
536
+ SECRET_SETUP --> BUILD_TRIGGER
537
+ BUILD_TRIGGER --> DEPENDENCY_INSTALL
538
+ DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
539
+ MODEL_DOWNLOAD --> APP_INITIALIZATION
540
+
541
+ APP_INITIALIZATION --> GRADIO_INTERFACE
542
+ GRADIO_INTERFACE --> MODEL_INFERENCE
543
+ MODEL_INFERENCE --> USER_INTERACTION
544
+
545
+ HF_HUB --> MODEL_DOWNLOAD
546
+ HF_SPACES --> GRADIO_INTERFACE
547
+
548
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
549
+ classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
550
+ classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
551
+ classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
552
+ classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
553
+ classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
554
+ classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
555
+
556
+ class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
557
+ class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
558
+ class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
559
+ class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
560
+ class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
561
+ class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
562
+ class HF_HUB,HF_SPACES external
563
+ </div>
564
+ </div>
565
+
566
+ <div id="dataflow" class="diagram-container">
567
+ <div class="diagram-title">Data Flow</div>
568
+ <div class="diagram-description">
569
+ Complete data journey through the Voxtral ASR Fine-tuning application from user input to deployed demo.
570
+ </div>
571
+ <div class="mermaid">
572
+ flowchart TD
573
+ subgraph "User Input"
574
+ MIC["Microphone Recording<br/>Raw audio + timestamps"]
575
+ FILE["File Upload<br/>WAV/FLAC files"]
576
+ TEXT["Manual Transcripts<br/>Text input"]
577
+ LANG["Language Selection<br/>25+ languages"]
578
+ end
579
+
580
+ subgraph "Data Processing"
581
+ AUDIO_PROC["Audio Processing<br/>Resampling to 16kHz<br/>Format conversion"]
582
+ TEXT_PROC["Text Processing<br/>Transcript validation<br/>Cleaning & formatting"]
583
+ JSONL_CONV["JSONL Conversion<br/>{'audio_path': '...', 'text': '...'}"]
584
+ end
585
+
586
+ subgraph "Dataset Storage"
587
+ LOCAL_DS["Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/"]
588
+ HF_DS["HF Hub Dataset<br/>username/dataset-name<br/>Public sharing"]
589
+ end
590
+
591
+ subgraph "Training Data Pipeline"
592
+ DS_LOADER["Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()"]
593
+ AUDIO_CAST["Audio Casting<br/>Audio(sampling_rate=16000)"]
594
+ TRAIN_SPLIT["Train Split<br/>train_dataset"]
595
+ EVAL_SPLIT["Eval Split<br/>eval_dataset"]
596
+ end
597
+
598
+ subgraph "Model Training"
599
+ COLLATOR["VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction"]
600
+ FORWARD["Forward Pass<br/>Audio β†’ Features β†’ Text"]
601
+ LOSS["Loss Calculation<br/>Masked LM loss"]
602
+ BACKWARD["Backward Pass<br/>Gradient computation"]
603
+ OPTIMIZE["Parameter Updates<br/>LoRA or full fine-tuning"]
604
+ end
605
+
606
+ subgraph "Training Outputs"
607
+ MODEL_FILES["Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json"]
608
+ TRAINING_LOGS["Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves"]
609
+ CHECKPOINTS["Checkpoints<br/>Intermediate models<br/>best model tracking"]
610
+ end
611
+
612
+ subgraph "Publishing Pipeline"
613
+ HF_REPO["HF Repository<br/>username/model-name<br/>Model hosting"]
614
+ MODEL_CARD["Model Card<br/>README.md<br/>Training details<br/>Usage examples"]
615
+ METADATA["Training Metadata<br/>Config + results<br/>Performance metrics"]
616
+ end
617
+
618
+ subgraph "Demo Deployment"
619
+ SPACE_REPO["HF Space Repository<br/>username/model-name-demo<br/>Demo hosting"]
620
+ DEMO_APP["Demo Application<br/>Gradio interface<br/>Real-time inference"]
621
+ ENV_VARS["Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets"]
622
+ end
623
+
624
+ MIC --> AUDIO_PROC
625
+ FILE --> AUDIO_PROC
626
+ TEXT --> TEXT_PROC
627
+ LANG --> TEXT_PROC
628
+
629
+ AUDIO_PROC --> JSONL_CONV
630
+ TEXT_PROC --> JSONL_CONV
631
+
632
+ JSONL_CONV --> LOCAL_DS
633
+ LOCAL_DS --> HF_DS
634
+
635
+ LOCAL_DS --> DS_LOADER
636
+ HF_DS --> DS_LOADER
637
+
638
+ DS_LOADER --> AUDIO_CAST
639
+ AUDIO_CAST --> TRAIN_SPLIT
640
+ AUDIO_CAST --> EVAL_SPLIT
641
+
642
+ TRAIN_SPLIT --> COLLATOR
643
+ EVAL_SPLIT --> COLLATOR
644
+
645
+ COLLATOR --> FORWARD
646
+ FORWARD --> LOSS
647
+ LOSS --> BACKWARD
648
+ BACKWARD --> OPTIMIZE
649
+
650
+ OPTIMIZE --> MODEL_FILES
651
+ OPTIMIZE --> TRAINING_LOGS
652
+ OPTIMIZE --> CHECKPOINTS
653
+
654
+ MODEL_FILES --> HF_REPO
655
+ TRAINING_LOGS --> HF_REPO
656
+ CHECKPOINTS --> HF_REPO
657
+
658
+ HF_REPO --> MODEL_CARD
659
+ TRAINING_LOGS --> MODEL_CARD
660
+
661
+ MODEL_CARD --> SPACE_REPO
662
+ HF_REPO --> SPACE_REPO
663
+ ENV_VARS --> SPACE_REPO
664
+
665
+ SPACE_REPO --> DEMO_APP
666
+
667
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
668
+ classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
669
+ classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
670
+ classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
671
+ classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
672
+ classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
673
+ classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
674
+
675
+ class MIC,FILE,TEXT,LANG input
676
+ class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
677
+ class LOCAL_DS,HF_DS storage
678
+ class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
679
+ class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
680
+ class HF_REPO,MODEL_CARD,METADATA publishing
681
+ class SPACE_REPO,DEMO_APP,ENV_VARS deployment
682
+ </div>
683
+ </div>
684
+
685
+ <script>
686
+ // Toggle mermaid code visibility
687
+ function toggleCode(diagramId) {
688
+ const codeBlock = document.querySelector(`#${diagramId} .mermaid-code`);
689
+ if (codeBlock.style.display === 'none' || codeBlock.style.display === '') {
690
+ codeBlock.style.display = 'block';
691
+ } else {
692
+ codeBlock.style.display = 'none';
693
+ }
694
+ }
695
+
696
+ // Add toggle buttons to each diagram
697
+ document.addEventListener('DOMContentLoaded', function() {
698
+ const diagrams = document.querySelectorAll('.diagram-container');
699
+ diagrams.forEach((diagram, index) => {
700
+ const diagramId = diagram.id;
701
+ const mermaidDiv = diagram.querySelector('.mermaid');
702
+
703
+ if (mermaidDiv) {
704
+ // Create toggle button
705
+ const toggleBtn = document.createElement('button');
706
+ toggleBtn.className = 'code-toggle';
707
+ toggleBtn.textContent = 'πŸ” Show Mermaid Code';
708
+ toggleBtn.onclick = () => toggleCode(diagramId);
709
+
710
+ // Create code block
711
+ const codeBlock = document.createElement('pre');
712
+ codeBlock.className = 'mermaid-code';
713
+ codeBlock.textContent = mermaidDiv.textContent.trim();
714
+
715
+ // Insert elements
716
+ mermaidDiv.parentNode.insertBefore(toggleBtn, mermaidDiv);
717
+ mermaidDiv.parentNode.insertBefore(codeBlock, mermaidDiv.nextSibling);
718
+ }
719
+ });
720
+ });
721
+
722
+ // Print functionality
723
+ function printDiagrams() {
724
+ window.print();
725
+ }
726
+ </script>
727
+ </body>
728
+ </html>
docs/interface-workflow.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Interface Workflow
2
+
3
+ ```mermaid
4
+ stateDiagram-v2
5
+ [*] --> LanguageSelection: User opens interface
6
+
7
+ state "Language & Dataset Setup" as LangSetup {
8
+ [*] --> LanguageSelection
9
+ LanguageSelection --> LoadPhrases: Select language
10
+ LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
11
+ DisplayPhrases --> RecordingInterface: Show phrases & recording UI
12
+
13
+ state RecordingInterface {
14
+ [*] --> ShowInitialRows: Display first 10 phrases
15
+ ShowInitialRows --> RecordAudio: User can record audio
16
+ RecordAudio --> AddMoreRows: Optional - add 10 more rows
17
+ AddMoreRows --> RecordAudio
18
+ }
19
+ }
20
+
21
+ RecordingInterface --> DatasetCreation: User finishes recording
22
+
23
+ state "Dataset Creation Options" as DatasetCreation {
24
+ [*] --> FromRecordings: Create from recorded audio
25
+ [*] --> FromUploads: Upload existing files
26
+
27
+ FromRecordings --> ProcessRecordings: Save WAV files + transcripts
28
+ FromUploads --> ProcessUploads: Process uploaded files + transcripts
29
+
30
+ ProcessRecordings --> CreateJSONL: Generate JSONL dataset
31
+ ProcessUploads --> CreateJSONL
32
+
33
+ CreateJSONL --> DatasetReady: Dataset saved locally
34
+ }
35
+
36
+ DatasetCreation --> TrainingConfiguration: Dataset ready
37
+
38
+ state "Training Setup" as TrainingConfiguration {
39
+ [*] --> BasicSettings: Model, LoRA/full, batch size
40
+ [*] --> AdvancedSettings: Learning rate, epochs, LoRA params
41
+
42
+ BasicSettings --> ConfigureDeployment: Repo name, push options
43
+ AdvancedSettings --> ConfigureDeployment
44
+
45
+ ConfigureDeployment --> StartTraining: All settings configured
46
+ }
47
+
48
+ TrainingConfiguration --> TrainingProcess: Start training
49
+
50
+ state "Training Process" as TrainingProcess {
51
+ [*] --> InitializeTrackio: Setup experiment tracking
52
+ InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
53
+ RunTrainingScript --> StreamLogs: Show real-time training logs
54
+ StreamLogs --> MonitorProgress: Track metrics & checkpoints
55
+
56
+ MonitorProgress --> TrainingComplete: Training finished
57
+ MonitorProgress --> HandleErrors: Training failed
58
+ HandleErrors --> RetryOrExit: User can retry or exit
59
+ }
60
+
61
+ TrainingProcess --> PostTraining: Training complete
62
+
63
+ state "Post-Training Actions" as PostTraining {
64
+ [*] --> PushToHub: Push model to HF Hub
65
+ [*] --> GenerateModelCard: Create model card
66
+ [*] --> DeployDemoSpace: Deploy interactive demo
67
+
68
+ PushToHub --> ModelPublished: Model available on HF Hub
69
+ GenerateModelCard --> ModelDocumented: Model card created
70
+ DeployDemoSpace --> DemoReady: Demo space deployed
71
+ }
72
+
73
+ PostTraining --> [*]: Process complete
74
+
75
+ %% Alternative paths
76
+ DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
77
+ PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
78
+
79
+ %% Error handling
80
+ TrainingProcess --> ErrorRecovery: Handle training errors
81
+ ErrorRecovery --> RetryTraining: Retry with different settings
82
+ RetryTraining --> TrainingConfiguration
83
+
84
+ %% Styling and notes
85
+ note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset
86
+ note right of RecordingInterface : Users record themselves<br/>reading displayed phrases
87
+ note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
88
+ note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc.
89
+ note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration
90
+ note right of PostTraining : Automated deployment<br/>pipeline
91
+ ```
92
+
93
+ ## Interface Workflow Overview
94
+
95
+ This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process.
96
+
97
+ ### Key Workflow Stages
98
+
99
+ #### 1. Language & Dataset Setup
100
+ - **Language Selection**: Users choose from 25+ European languages supported by NVIDIA Granary
101
+ - **Phrase Loading**: System loads authentic, high-quality phrases in the selected language
102
+ - **Recording Interface**: Dynamic interface showing phrases with audio recording components
103
+ - **Progressive Disclosure**: Users can add more rows as needed (up to 100 recordings)
104
+
105
+ #### 2. Dataset Creation
106
+ - **From Recordings**: Process microphone recordings into WAV files and JSONL dataset
107
+ - **From Uploads**: Handle existing WAV/FLAC files with manual transcripts
108
+ - **JSONL Format**: Standard format with `audio_path` and `text` fields
109
+ - **Local Storage**: Datasets stored in `datasets/voxtral_user/` directory
110
+
111
+ #### 3. Training Configuration
112
+ - **Basic Settings**: Model selection, LoRA vs full fine-tuning, batch size
113
+ - **Advanced Settings**: Learning rate, epochs, gradient accumulation
114
+ - **LoRA Parameters**: r, alpha, dropout, audio tower freezing options
115
+ - **Repository Setup**: Model naming and Hugging Face Hub integration
116
+
117
+ #### 4. Training Process
118
+ - **Trackio Integration**: Automatic experiment tracking setup
119
+ - **Script Execution**: Calls appropriate training script (`train.py` or `train_lora.py`)
120
+ - **Log Streaming**: Real-time display of training progress and metrics
121
+ - **Error Handling**: Graceful handling of training failures with retry options
122
+
123
+ #### 5. Post-Training Actions
124
+ - **Model Publishing**: Automatic push to Hugging Face Hub
125
+ - **Model Card Generation**: Automated creation using `generate_model_card.py`
126
+ - **Demo Deployment**: One-click deployment of interactive demo spaces
127
+
128
+ ### Alternative Paths
129
+
130
+ #### Dataset-Only Workflow
131
+ - Users can create and publish datasets without training models
132
+ - Useful for dataset curation and sharing
133
+
134
+ #### Error Recovery
135
+ - Training failures trigger error recovery flows
136
+ - Users can retry with modified parameters
137
+ - Comprehensive error logging and debugging information
138
+
139
+ ### Technical Integration Points
140
+
141
+ #### External Services
142
+ - **NVIDIA Granary**: Source of high-quality multilingual ASR data
143
+ - **Hugging Face Hub**: Model and dataset storage and sharing
144
+ - **Trackio Spaces**: Experiment tracking and visualization
145
+
146
+ #### Script Integration
147
+ - **interface.py**: Main Gradio application orchestrating the workflow
148
+ - **train.py/train_lora.py**: Core training scripts with Trackio integration
149
+ - **push_to_huggingface.py**: Model/dataset publishing
150
+ - **deploy_demo_space.py**: Automated demo deployment
151
+ - **generate_model_card.py**: Model documentation generation
152
+
153
+ ### User Experience Features
154
+
155
+ #### Progressive Interface Reveal
156
+ - Interface components are revealed as users progress through workflow
157
+ - Reduces cognitive load and guides users step-by-step
158
+
159
+ #### Real-time Feedback
160
+ - Live log streaming during training
161
+ - Progress indicators and status updates
162
+ - Immediate feedback on dataset creation and validation
163
+
164
+ #### Flexible Input Methods
165
+ - Support for both live recording and file uploads
166
+ - Multiple language options for diverse user needs
167
+ - Scalable recording interface (10-100 samples)
168
+
169
+ See also:
170
+ - [Architecture Overview](architecture.md)
171
+ - [Training Pipeline](training-pipeline.md)
172
+ - [Data Flow](data-flow.md)
173
+
docs/training-pipeline.md ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training Pipeline
2
+
3
+ ```mermaid
4
+ graph TB
5
+ %% Input Data Sources
6
+ subgraph "Data Sources"
7
+ JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
8
+ GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
9
+ HFDATA[HF Hub Datasets<br/>Community Datasets]
10
+ end
11
+
12
+ %% Data Processing
13
+ subgraph "Data Processing"
14
+ LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
15
+ CASTER[Audio Casting<br/>16kHz resampling]
16
+ COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
17
+ end
18
+
19
+ %% Training Scripts
20
+ subgraph "Training Scripts"
21
+ TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
22
+ TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]
23
+
24
+ subgraph "Training Components"
25
+ MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
26
+ LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
27
+ PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
28
+ end
29
+ end
30
+
31
+ %% Training Infrastructure
32
+ subgraph "Training Infrastructure"
33
+ TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
34
+ HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
35
+ TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
36
+ end
37
+
38
+ %% Training Process
39
+ subgraph "Training Process"
40
+ FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
41
+ LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
42
+ BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
43
+ OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
44
+ LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
45
+ end
46
+
47
+ %% Model Management
48
+ subgraph "Model Management"
49
+ CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
50
+ MODEL_SAVING[Final Model Saving<br/>Processor + Model]
51
+ LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
52
+ end
53
+
54
+ %% Flow Connections
55
+ JSONL --> LOADER
56
+ GRANARY --> LOADER
57
+ HFDATA --> LOADER
58
+
59
+ LOADER --> CASTER
60
+ CASTER --> COLLATOR
61
+
62
+ COLLATOR --> TRAIN_FULL
63
+ COLLATOR --> TRAIN_LORA
64
+
65
+ TRAIN_FULL --> MODEL_INIT
66
+ TRAIN_LORA --> MODEL_INIT
67
+ TRAIN_LORA --> LORA_CONFIG
68
+
69
+ MODEL_INIT --> PROCESSOR_INIT
70
+ LORA_CONFIG --> PROCESSOR_INIT
71
+
72
+ PROCESSOR_INIT --> TRACKIO_INIT
73
+ PROCESSOR_INIT --> HF_TRAINER
74
+ PROCESSOR_INIT --> TORCH_DEVICE
75
+
76
+ TRACKIO_INIT --> HF_TRAINER
77
+ TORCH_DEVICE --> HF_TRAINER
78
+
79
+ HF_TRAINER --> FORWARD_PASS
80
+ FORWARD_PASS --> LOSS_CALC
81
+ LOSS_CALC --> BACKWARD_PASS
82
+ BACKWARD_PASS --> OPTIMIZER_STEP
83
+ OPTIMIZER_STEP --> LOGGING
84
+
85
+ LOGGING --> CHECKPOINT_SAVING
86
+ LOGGING --> TRACKIO_INIT
87
+
88
+ HF_TRAINER --> MODEL_SAVING
89
+ MODEL_SAVING --> LOCAL_STORAGE
90
+
91
+ %% Styling
92
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
93
+ classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
94
+ classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
95
+ classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
96
+ classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
97
+ classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
98
+
99
+ class JSONL,GRANARY,HFDATA input
100
+ class LOADER,CASTER,COLLATOR processing
101
+ class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
102
+ class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
103
+ class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
104
+ class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
105
+ ```
106
+
107
+ ## Training Pipeline Overview
108
+
109
+ This diagram illustrates the complete training pipeline for Voxtral ASR fine-tuning, showing how data flows through the training scripts and supporting infrastructure.
110
+
111
+ ### Data Input Sources
112
+
113
+ #### JSONL Datasets
114
+ - **Local Datasets**: User-created datasets from recordings or uploads
115
+ - **Format**: `{"audio_path": "path/to/audio.wav", "text": "transcription"}`
116
+ - **Processing**: Loaded via `_load_jsonl_dataset()` function
117
+
118
+ #### NVIDIA Granary Dataset
119
+ - **Multilingual Support**: 25+ European languages
120
+ - **High Quality**: Curated ASR training data
121
+ - **Streaming**: Efficient loading without full download
122
+
123
+ #### Hugging Face Hub Datasets
124
+ - **Community Datasets**: Public datasets from HF Hub
125
+ - **Standard Formats**: Compatible with Voxtral training requirements
126
+
127
+ ### Data Processing Pipeline
128
+
129
+ #### Dataset Loading
130
+ ```python
131
+ # Load local JSONL or HF dataset
132
+ ds = _load_jsonl_dataset(jsonl_path)
133
+ # or
134
+ ds = load_dataset(ds_name, ds_cfg, split="test")
135
+ ```
136
+
137
+ #### Audio Processing
138
+ ```python
139
+ # Cast to Audio format with 16kHz resampling
140
+ ds = ds.cast_column("audio", Audio(sampling_rate=16000))
141
+ ```
142
+
143
+ #### Data Collation
144
+ - **VoxtralDataCollator**: Custom collator for Voxtral training
145
+ - **Audio Processing**: Converts audio to model inputs
146
+ - **Text Tokenization**: Processes transcription text
147
+ - **Masking**: Masks prompt tokens during training
148
+
149
+ ### Training Script Architecture
150
+
151
+ #### Full Fine-tuning (`train.py`)
152
+ - **Complete Model Updates**: All parameters trainable
153
+ - **Higher Memory Requirements**: Full model in memory
154
+ - **Better Convergence**: Can achieve higher accuracy
155
+
156
+ #### LoRA Fine-tuning (`train_lora.py`)
157
+ - **Parameter Efficient**: Only LoRA adapters trained
158
+ - **Lower Memory Usage**: Base model frozen
159
+ - **Faster Training**: Fewer parameters to update
160
+ - **Configurable**: r, alpha, dropout parameters
161
+
162
+ ### Training Infrastructure
163
+
164
+ #### Trackio Integration
165
+ ```python
166
+ trackio.init(
167
+ project="voxtral-finetuning",
168
+ config={...}, # Training parameters
169
+ space_id=trackio_space
170
+ )
171
+ ```
172
+
173
+ #### Hugging Face Trainer
174
+ ```python
175
+ training_args = TrainingArguments(
176
+ output_dir=output_dir,
177
+ per_device_train_batch_size=batch_size,
178
+ learning_rate=learning_rate,
179
+ num_train_epochs=epochs,
180
+ bf16=True, # BFloat16 for efficiency
181
+ report_to=["trackio"],
182
+ # ... other args
183
+ )
184
+ ```
185
+
186
+ #### Device Management
187
+ - **GPU Detection**: Automatic CUDA/GPU detection
188
+ - **Fallback**: CPU training if no GPU available
189
+ - **Memory Optimization**: Model sharding and gradient checkpointing
190
+
191
+ ### Training Process Flow
192
+
193
+ #### Forward Pass
194
+ 1. **Audio Input**: Raw audio waveforms
195
+ 2. **Audio Tower**: Audio feature extraction
196
+ 3. **Text Generation**: Autoregressive text generation from audio features
197
+
198
+ #### Loss Calculation
199
+ - **Masked Language Modeling**: Only transcription tokens contribute to loss
200
+ - **Audio Prompt Masking**: Audio processing tokens are masked out
201
+ - **Cross-Entropy Loss**: Standard language modeling loss
202
+
203
+ #### Backward Pass & Optimization
204
+ - **Gradient Computation**: Backpropagation through the model
205
+ - **LoRA Updates**: Only adapter parameters updated (LoRA mode)
206
+ - **Full Updates**: All parameters updated (full fine-tuning)
207
+
208
+ ### Model Management
209
+
210
+ #### Checkpoint Saving
211
+ - **Regular Checkpoints**: Saved every N steps
212
+ - **Best Model Tracking**: Save best model based on validation loss
213
+ - **Resume Capability**: Continue training from checkpoints
214
+
215
+ #### Final Model Saving
216
+ ```python
217
+ trainer.save_model() # Saves model and tokenizer
218
+ processor.save_pretrained(output_dir) # Saves processor
219
+ ```
220
+
221
+ #### Local Storage Structure
222
+ ```
223
+ outputs/
224
+ β”œβ”€β”€ voxtral-finetuned-{timestamp}/
225
+ β”‚ β”œβ”€β”€ config.json
226
+ β”‚ β”œβ”€β”€ model.safetensors
227
+ β”‚ β”œβ”€β”€ tokenizer.json
228
+ β”‚ β”œβ”€β”€ training_config.json
229
+ β”‚ β”œβ”€β”€ train_results.json
230
+ β”‚ └── eval_results.json
231
+ ```
232
+
233
+ ### Integration Points
234
+
235
+ #### With Interface (`interface.py`)
236
+ - **Parameter Passing**: Training parameters from UI
237
+ - **Log Streaming**: Real-time training logs to UI
238
+ - **Progress Monitoring**: Training progress updates
239
+
240
+ #### With Model Publishing (`push_to_huggingface.py`)
241
+ - **Model Upload**: Trained model to HF Hub
242
+ - **Metadata**: Training config and results
243
+ - **Model Cards**: Automatic model card generation
244
+
245
+ #### With Demo Deployment (`deploy_demo_space.py`)
246
+ - **Space Creation**: HF Spaces for demos
247
+ - **Model Integration**: Deploy trained model in demo
248
+ - **Configuration**: Demo-specific settings
249
+
250
+ ### Performance Considerations
251
+
252
+ #### Memory Optimization
253
+ - **LoRA**: Significantly reduces memory requirements
254
+ - **Gradient Checkpointing**: Trade compute for memory
255
+ - **Mixed Precision**: BF16/FP16 training
256
+
257
+ #### Training Efficiency
258
+ - **Batch Size**: Balanced with gradient accumulation
259
+ - **Learning Rate**: Warmup and decay schedules
260
+ - **Early Stopping**: Prevent overfitting
261
+
262
+ #### Monitoring & Debugging
263
+ - **Metrics Tracking**: Loss, perplexity, learning rate
264
+ - **GPU Utilization**: Memory and compute monitoring
265
+ - **Error Handling**: Graceful failure recovery
266
+
267
+ See also:
268
+ - [Architecture Overview](architecture.md)
269
+ - [Interface Workflow](interface-workflow.md)
270
+ - [Data Flow](data-flow.md)
271
+
scripts/generate_svgs.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Generate SVG versions of Mermaid diagrams for documentation
4
+ """
5
+
6
+ import os
7
+ import re
8
+ import requests
9
+ import json
10
+ from pathlib import Path
11
+ from typing import Optional
12
+
13
+ class MermaidToSVGConverter:
14
+ """Convert Mermaid diagrams to SVG format"""
15
+
16
+ def __init__(self):
17
+ self.mermaid_api_url = "https://mermaid.ink/img/"
18
+
19
+ def extract_mermaid_code(self, markdown_file: Path) -> Optional[str]:
20
+ """Extract Mermaid code from a Markdown file"""
21
+ try:
22
+ with open(markdown_file, 'r', encoding='utf-8') as f:
23
+ content = f.read()
24
+
25
+ # Find Mermaid code blocks
26
+ mermaid_pattern = r'```mermaid\s*\n(.*?)\n```'
27
+ match = re.search(mermaid_pattern, content, re.DOTALL)
28
+
29
+ if match:
30
+ return match.group(1).strip()
31
+ else:
32
+ print(f"No Mermaid diagram found in {markdown_file}")
33
+ return None
34
+
35
+ except Exception as e:
36
+ print(f"Error reading {markdown_file}: {e}")
37
+ return None
38
+
39
+ def convert_to_svg(self, mermaid_code: str, output_path: Path) -> bool:
40
+ """Convert Mermaid code to SVG using mermaid.ink service"""
41
+ try:
42
+ # Encode the Mermaid code for the URL
43
+ import base64
44
+ import urllib.parse
45
+
46
+ # Create the data URL format expected by mermaid.ink
47
+ mermaid_data = f"%%{{init: {{'theme': 'base', 'themeVariables': {{'primaryColor': '#e3f2fd', 'primaryTextColor': '#1976d2', 'primaryBorderColor': '#01579b', 'lineColor': '#424242', 'secondaryColor': '#fff3e0', 'tertiaryColor': '#fce4ec'}}}}}}%%\n{mermaid_code}"
48
+
49
+ # Base64 encode the mermaid code
50
+ encoded = base64.b64encode(mermaid_data.encode('utf-8')).decode('utf-8')
51
+ url_encoded = urllib.parse.quote(encoded)
52
+
53
+ # Create the full URL
54
+ full_url = f"{self.mermaid_api_url}{url_encoded}"
55
+
56
+ # Make the request
57
+ response = requests.get(full_url, timeout=30)
58
+
59
+ if response.status_code == 200:
60
+ # Save the SVG
61
+ with open(output_path, 'wb') as f:
62
+ f.write(response.content)
63
+ print(f"βœ… Generated SVG: {output_path}")
64
+ return True
65
+ else:
66
+ print(f"❌ Failed to generate SVG for {output_path}: HTTP {response.status_code}")
67
+ return False
68
+
69
+ except Exception as e:
70
+ print(f"❌ Error generating SVG for {output_path}: {e}")
71
+ return False
72
+
73
+ def process_markdown_file(self, markdown_file: Path, output_dir: Path) -> bool:
74
+ """Process a single Markdown file and generate its SVG"""
75
+ # Extract Mermaid code
76
+ mermaid_code = self.extract_mermaid_code(markdown_file)
77
+ if not mermaid_code:
78
+ return False
79
+
80
+ # Create output filename
81
+ svg_filename = markdown_file.stem + ".svg"
82
+ output_path = output_dir / svg_filename
83
+
84
+ # Convert to SVG
85
+ return self.convert_to_svg(mermaid_code, output_path)
86
+
87
+ def main():
88
+ """Main function to generate SVGs for all documentation files"""
89
+ print("πŸ”„ Generating SVG versions of documentation diagrams...")
90
+
91
+ # Setup paths
92
+ docs_dir = Path(__file__).parent.parent / "docs"
93
+ svgs_dir = docs_dir / "svgs"
94
+
95
+ # Create SVGs directory
96
+ svgs_dir.mkdir(exist_ok=True)
97
+
98
+ # Initialize converter
99
+ converter = MermaidToSVGConverter()
100
+
101
+ # Process all Markdown files in docs directory
102
+ markdown_files = [
103
+ "README.md",
104
+ "architecture.md",
105
+ "interface-workflow.md",
106
+ "training-pipeline.md",
107
+ "deployment-pipeline.md",
108
+ "data-flow.md"
109
+ ]
110
+
111
+ success_count = 0
112
+ total_count = len(markdown_files)
113
+
114
+ for filename in markdown_files:
115
+ markdown_path = docs_dir / filename
116
+ if markdown_path.exists():
117
+ print(f"\nπŸ“„ Processing {filename}...")
118
+ if converter.process_markdown_file(markdown_path, svgs_dir):
119
+ success_count += 1
120
+ else:
121
+ print(f"⚠️ File not found: {markdown_path}")
122
+
123
+ print(f"\nπŸŽ‰ SVG generation complete!")
124
+ print(f"βœ… Successfully generated: {success_count}/{total_count} SVGs")
125
+ print(f"πŸ“ SVGs saved to: {svgs_dir}")
126
+
127
+ if success_count < total_count:
128
+ print(f"❌ Failed to generate: {total_count - success_count} SVGs")
129
+ return 1
130
+
131
+ return 0
132
+
133
+ if __name__ == "__main__":
134
+ exit(main())
135
+
scripts/validate_mermaid.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Validate Mermaid syntax in HTML documentation
4
+ """
5
+
6
+ import re
7
+
8
+ def validate_mermaid_html(html_file):
9
+ """Validate Mermaid diagrams in HTML file"""
10
+ print(f"πŸ” Validating Mermaid syntax in {html_file}")
11
+
12
+ with open(html_file, 'r', encoding='utf-8') as f:
13
+ content = f.read()
14
+
15
+ # Find all Mermaid blocks
16
+ mermaid_pattern = r'<div class="mermaid">(.*?)</div>'
17
+ mermaid_blocks = re.findall(mermaid_pattern, content, re.DOTALL)
18
+
19
+ print(f"πŸ“Š Found {len(mermaid_blocks)} Mermaid blocks")
20
+
21
+ issues = []
22
+
23
+ # Check each Mermaid block
24
+ for i, block in enumerate(mermaid_blocks):
25
+ lines = block.strip().split('\n')
26
+ if not lines or not lines[0].strip():
27
+ issues.append(f"Block {i+1}: Empty Mermaid block")
28
+ continue
29
+
30
+ first_line = lines[0].strip()
31
+
32
+ # Check if it starts with a valid diagram type
33
+ valid_starts = [
34
+ 'graph', 'flowchart', 'stateDiagram', 'sequenceDiagram',
35
+ 'classDiagram', 'erDiagram', 'journey', 'gantt', 'pie',
36
+ 'gitgraph', 'mindmap', 'timeline', 'sankey'
37
+ ]
38
+
39
+ if not any(first_line.startswith(start) for start in valid_starts):
40
+ issues.append(f"Block {i+1}: Invalid diagram type start - '{first_line}'")
41
+
42
+ # Check for classDef/class consistency
43
+ if 'classDef' in block:
44
+ class_statements = len(re.findall(r'^\s*class\s+', block, re.MULTILINE))
45
+ if class_statements == 0:
46
+ issues.append(f"Block {i+1}: classDef defined but no class statements found")
47
+
48
+ # Check for basic syntax issues
49
+ if block.count('[') != block.count(']'):
50
+ issues.append(f"Block {i+1}: Unmatched square brackets")
51
+
52
+ if block.count('(') != block.count(')'):
53
+ issues.append(f"Block {i+1}: Unmatched parentheses")
54
+
55
+ if 'subgraph' in block:
56
+ subgraph_count = block.count('subgraph')
57
+ end_count = block.count('end')
58
+ if subgraph_count != end_count:
59
+ issues.append(f"Block {i+1}: Unmatched subgraph/end blocks ({subgraph_count} vs {end_count})")
60
+
61
+ # Report results
62
+ print("\nπŸ” Validation Results:")
63
+ if issues:
64
+ print("❌ Issues found:")
65
+ for issue in issues:
66
+ print(f" - {issue}")
67
+ return False
68
+ else:
69
+ print("βœ… No syntax issues found!")
70
+ return True
71
+
72
+ if __name__ == "__main__":
73
+ validate_mermaid_html("docs/diagrams.html")