Spaces:
Running
Running
Joseph Pollack
commited on
adds docs
Browse files- docs/README.md +246 -0
- docs/architecture.md +126 -0
- docs/data-flow.md +374 -0
- docs/deployment-pipeline.md +323 -0
- docs/diagrams.html +728 -0
- docs/interface-workflow.md +173 -0
- docs/training-pipeline.md +271 -0
- scripts/generate_svgs.py +135 -0
- scripts/validate_mermaid.py +73 -0
docs/README.md
ADDED
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Voxtral ASR Fine-tuning Documentation
|
2 |
+
|
3 |
+
```mermaid
|
4 |
+
graph TD
|
5 |
+
%% Main Entry Point
|
6 |
+
START([π― Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
|
7 |
+
|
8 |
+
%% Documentation Categories
|
9 |
+
OVERVIEW --> ARCH[ποΈ Architecture Overview]
|
10 |
+
OVERVIEW --> WORKFLOW[π Interface Workflow]
|
11 |
+
OVERVIEW --> TRAINING[π Training Pipeline]
|
12 |
+
OVERVIEW --> DEPLOYMENT[π Deployment Pipeline]
|
13 |
+
OVERVIEW --> DATAFLOW[π Data Flow]
|
14 |
+
|
15 |
+
%% Architecture Section
|
16 |
+
ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
|
17 |
+
ARCH --> ARCH_LINK[π View Details β](architecture.md)
|
18 |
+
|
19 |
+
%% Interface Section
|
20 |
+
WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording β Training β Demo]
|
21 |
+
WORKFLOW --> WORKFLOW_LINK[π View Details β](interface-workflow.md)
|
22 |
+
|
23 |
+
%% Training Section
|
24 |
+
TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data β Model β Results]
|
25 |
+
TRAINING --> TRAINING_LINK[π View Details β](training-pipeline.md)
|
26 |
+
|
27 |
+
%% Deployment Section
|
28 |
+
DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model β Hub β Space]
|
29 |
+
DEPLOYMENT --> DEPLOYMENT_LINK[π View Details β](deployment-pipeline.md)
|
30 |
+
|
31 |
+
%% Data Flow Section
|
32 |
+
DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input β Processing β Output]
|
33 |
+
DATAFLOW --> DATAFLOW_LINK[π View Details β](data-flow.md)
|
34 |
+
|
35 |
+
%% Key Components Highlight
|
36 |
+
subgraph "ποΈ Core Components"
|
37 |
+
INTERFACE[interface.py<br/>Gradio Web UI]
|
38 |
+
TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
|
39 |
+
DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
|
40 |
+
PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
|
41 |
+
end
|
42 |
+
|
43 |
+
%% Data Flow Highlight
|
44 |
+
subgraph "π Key Data Formats"
|
45 |
+
JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
|
46 |
+
HFDATA[HF Hub Models<br/>username/model-name]
|
47 |
+
SPACES[HF Spaces<br/>Interactive Demos]
|
48 |
+
end
|
49 |
+
|
50 |
+
%% Connect components to their respective docs
|
51 |
+
INTERFACE --> WORKFLOW
|
52 |
+
TRAIN_SCRIPTS --> TRAINING
|
53 |
+
DEPLOY_SCRIPT --> DEPLOYMENT
|
54 |
+
PUSH_SCRIPT --> DEPLOYMENT
|
55 |
+
|
56 |
+
JSONL --> DATAFLOW
|
57 |
+
HFDATA --> DEPLOYMENT
|
58 |
+
SPACES --> DEPLOYMENT
|
59 |
+
|
60 |
+
%% Styling
|
61 |
+
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
|
62 |
+
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
63 |
+
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
64 |
+
classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
65 |
+
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
66 |
+
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
67 |
+
|
68 |
+
class START entry
|
69 |
+
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
|
70 |
+
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
|
71 |
+
class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
|
72 |
+
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
|
73 |
+
class JSONL,HFDATA,SPACES data
|
74 |
+
```
|
75 |
+
|
76 |
+
## Voxtral ASR Fine-tuning Application
|
77 |
+
|
78 |
+
This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows.
|
79 |
+
|
80 |
+
### π― What is Voxtral ASR Fine-tuning?
|
81 |
+
|
82 |
+
Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides:
|
83 |
+
|
84 |
+
- **ποΈ Easy Data Collection**: Record audio or upload files with transcripts
|
85 |
+
- **π One-Click Training**: Fine-tune Voxtral with LoRA or full parameter updates
|
86 |
+
- **π Instant Deployment**: Deploy interactive demos to Hugging Face Spaces
|
87 |
+
- **π Experiment Tracking**: Monitor training progress with Trackio integration
|
88 |
+
|
89 |
+
### π Documentation Overview
|
90 |
+
|
91 |
+
#### ποΈ [Architecture Overview](architecture.md)
|
92 |
+
High-level view of system components and their relationships:
|
93 |
+
- **User Interface Layer**: Gradio web interface
|
94 |
+
- **Data Processing Layer**: Audio processing and dataset creation
|
95 |
+
- **Training Layer**: Full and LoRA fine-tuning scripts
|
96 |
+
- **Model Management Layer**: HF Hub integration and model cards
|
97 |
+
- **Deployment Layer**: Demo space deployment
|
98 |
+
|
99 |
+
#### π [Interface Workflow](interface-workflow.md)
|
100 |
+
Complete user journey through the application:
|
101 |
+
- **Language Selection**: Choose from 25+ languages via NVIDIA Granary
|
102 |
+
- **Data Collection**: Record audio or upload existing files
|
103 |
+
- **Dataset Creation**: Process audio + transcripts into JSONL format
|
104 |
+
- **Training Configuration**: Set hyperparameters and options
|
105 |
+
- **Live Training**: Real-time progress monitoring
|
106 |
+
- **Auto Deployment**: One-click model publishing and demo creation
|
107 |
+
|
108 |
+
#### π [Training Pipeline](training-pipeline.md)
|
109 |
+
Detailed training process and script interactions:
|
110 |
+
- **Data Sources**: JSONL datasets, HF Hub datasets, NVIDIA Granary
|
111 |
+
- **Data Processing**: Audio resampling, text tokenization, data collation
|
112 |
+
- **Training Scripts**: `train.py` (full) vs `train_lora.py` (parameter-efficient)
|
113 |
+
- **Infrastructure**: Trackio logging, Hugging Face Trainer, device management
|
114 |
+
- **Model Outputs**: Trained models, training logs, checkpoints
|
115 |
+
|
116 |
+
#### π [Deployment Pipeline](deployment-pipeline.md)
|
117 |
+
Model publishing and demo deployment process:
|
118 |
+
- **Model Publishing**: Push to Hugging Face Hub with metadata
|
119 |
+
- **Model Card Generation**: Automated documentation creation
|
120 |
+
- **Demo Space Deployment**: Create interactive demos on HF Spaces
|
121 |
+
- **Configuration Management**: Environment variables and secrets
|
122 |
+
- **Live Demo Features**: Real-time ASR inference interface
|
123 |
+
|
124 |
+
#### π [Data Flow](data-flow.md)
|
125 |
+
Complete data journey through the system:
|
126 |
+
- **Input Sources**: Microphone recordings, file uploads, external datasets
|
127 |
+
- **Processing Pipeline**: Audio resampling, text cleaning, JSONL conversion
|
128 |
+
- **Training Flow**: Dataset loading, batching, model training
|
129 |
+
- **Output Pipeline**: Model files, logs, checkpoints, published assets
|
130 |
+
- **External Integration**: HF Hub, NVIDIA Granary, Trackio Spaces
|
131 |
+
|
132 |
+
### π οΈ Core Components
|
133 |
+
|
134 |
+
| Component | Purpose | Key Features |
|
135 |
+
|-----------|---------|--------------|
|
136 |
+
| `interface.py` | Main web application | Gradio UI, data collection, training orchestration |
|
137 |
+
| `scripts/train.py` | Full model fine-tuning | Complete parameter updates, maximum accuracy |
|
138 |
+
| `scripts/train_lora.py` | LoRA fine-tuning | Parameter-efficient, faster training, lower memory |
|
139 |
+
| `scripts/deploy_demo_space.py` | Demo deployment | Automated HF Spaces creation and configuration |
|
140 |
+
| `scripts/push_to_huggingface.py` | Model publishing | HF Hub integration, model card generation |
|
141 |
+
| `scripts/generate_model_card.py` | Documentation | Automated model card creation from templates |
|
142 |
+
|
143 |
+
### π Key Data Formats
|
144 |
+
|
145 |
+
#### JSONL Dataset Format
|
146 |
+
```json
|
147 |
+
{"audio_path": "path/to/audio.wav", "text": "transcription text"}
|
148 |
+
```
|
149 |
+
|
150 |
+
#### Training Configuration
|
151 |
+
```json
|
152 |
+
{
|
153 |
+
"model_checkpoint": "mistralai/Voxtral-Mini-3B-2507",
|
154 |
+
"batch_size": 2,
|
155 |
+
"learning_rate": 5e-5,
|
156 |
+
"epochs": 3,
|
157 |
+
"lora_r": 8,
|
158 |
+
"lora_alpha": 32
|
159 |
+
}
|
160 |
+
```
|
161 |
+
|
162 |
+
#### Model Repository Structure
|
163 |
+
```
|
164 |
+
username/model-name/
|
165 |
+
βββ model.safetensors
|
166 |
+
βββ config.json
|
167 |
+
βββ tokenizer.json
|
168 |
+
βββ README.md (model card)
|
169 |
+
βββ training_results/
|
170 |
+
```
|
171 |
+
|
172 |
+
### π Quick Start
|
173 |
+
|
174 |
+
1. **Set Environment Variables**:
|
175 |
+
```bash
|
176 |
+
export HF_TOKEN=your_huggingface_token
|
177 |
+
export HF_USERNAME=your_username
|
178 |
+
```
|
179 |
+
|
180 |
+
2. **Launch Interface**:
|
181 |
+
```bash
|
182 |
+
python interface.py
|
183 |
+
```
|
184 |
+
|
185 |
+
3. **Follow the Workflow**:
|
186 |
+
- Select language β Record/upload data β Configure training β Start training
|
187 |
+
- Monitor progress β View results β Deploy demo
|
188 |
+
|
189 |
+
### π Prerequisites
|
190 |
+
|
191 |
+
- **Hardware**: NVIDIA GPU recommended for training
|
192 |
+
- **Software**: Python 3.8+, CUDA-compatible GPU drivers
|
193 |
+
- **Tokens**: Hugging Face token for model access and publishing
|
194 |
+
- **Storage**: Sufficient disk space for models and datasets
|
195 |
+
|
196 |
+
### π§ Configuration Options
|
197 |
+
|
198 |
+
#### Training Modes
|
199 |
+
- **LoRA Fine-tuning**: Efficient, fast, lower memory usage
|
200 |
+
- **Full Fine-tuning**: Maximum accuracy, higher memory requirements
|
201 |
+
|
202 |
+
#### Data Sources
|
203 |
+
- **User Recordings**: Live microphone input
|
204 |
+
- **File Uploads**: Existing WAV/FLAC files
|
205 |
+
- **NVIDIA Granary**: High-quality multilingual datasets
|
206 |
+
- **HF Hub Datasets**: Community-contributed datasets
|
207 |
+
|
208 |
+
#### Deployment Options
|
209 |
+
- **HF Hub Publishing**: Share models publicly
|
210 |
+
- **Demo Spaces**: Interactive web demos
|
211 |
+
- **Model Cards**: Automated documentation
|
212 |
+
|
213 |
+
### π Performance & Metrics
|
214 |
+
|
215 |
+
#### Training Metrics
|
216 |
+
- **Loss Curves**: Training and validation loss
|
217 |
+
- **Perplexity**: Model confidence measure
|
218 |
+
- **Word Error Rate**: ASR accuracy (if available)
|
219 |
+
- **Training Time**: Time to convergence
|
220 |
+
|
221 |
+
#### Resource Usage
|
222 |
+
- **GPU Memory**: Peak memory usage during training
|
223 |
+
- **Training Time**: Hours/days depending on dataset size
|
224 |
+
- **Model Size**: Disk space requirements
|
225 |
+
|
226 |
+
### π€ Contributing
|
227 |
+
|
228 |
+
The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect:
|
229 |
+
|
230 |
+
- **architecture.md**: System overview and component relationships
|
231 |
+
- **interface-workflow.md**: User experience and interaction flow
|
232 |
+
- **training-pipeline.md**: Technical training process details
|
233 |
+
- **deployment-pipeline.md**: Publishing and deployment mechanics
|
234 |
+
- **data-flow.md**: Data movement and transformation
|
235 |
+
|
236 |
+
### π Additional Resources
|
237 |
+
|
238 |
+
- **Hugging Face Spaces**: [Live Demo](https://huggingface.co/spaces)
|
239 |
+
- **Voxtral Models**: [Model Hub](https://huggingface.co/mistralai)
|
240 |
+
- **NVIDIA Granary**: [Dataset Documentation](https://huggingface.co/nvidia/Granary)
|
241 |
+
- **Trackio**: [Experiment Tracking](https://trackio.space)
|
242 |
+
|
243 |
+
---
|
244 |
+
|
245 |
+
*This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.*
|
246 |
+
|
docs/architecture.md
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Voxtral ASR Fine-tuning Architecture
|
2 |
+
|
3 |
+
```mermaid
|
4 |
+
graph TB
|
5 |
+
%% User Interface Layer
|
6 |
+
subgraph "User Interface"
|
7 |
+
UI[Gradio Web Interface<br/>interface.py]
|
8 |
+
REC[Audio Recording<br/>Microphone Input]
|
9 |
+
UP[File Upload<br/>WAV/FLAC files]
|
10 |
+
end
|
11 |
+
|
12 |
+
%% Data Processing Layer
|
13 |
+
subgraph "Data Processing"
|
14 |
+
DP[Data Processing<br/>Audio resampling<br/>JSONL creation]
|
15 |
+
DS[Dataset Management<br/>NVIDIA Granary<br/>Local datasets]
|
16 |
+
end
|
17 |
+
|
18 |
+
%% Training Layer
|
19 |
+
subgraph "Training Pipeline"
|
20 |
+
TF[Full Fine-tuning<br/>scripts/train.py]
|
21 |
+
TL[LoRA Fine-tuning<br/>scripts/train_lora.py]
|
22 |
+
TI[Trackio Integration<br/>Experiment Tracking]
|
23 |
+
end
|
24 |
+
|
25 |
+
%% Model Management Layer
|
26 |
+
subgraph "Model Management"
|
27 |
+
MM[Model Management<br/>Hugging Face Hub<br/>Local storage]
|
28 |
+
MC[Model Card Generation<br/>scripts/generate_model_card.py]
|
29 |
+
end
|
30 |
+
|
31 |
+
%% Deployment Layer
|
32 |
+
subgraph "Deployment & Demo"
|
33 |
+
DEP[Demo Space Deployment<br/>scripts/deploy_demo_space.py]
|
34 |
+
HF[HF Spaces<br/>Interactive Demo]
|
35 |
+
end
|
36 |
+
|
37 |
+
%% External Services
|
38 |
+
subgraph "External Services"
|
39 |
+
HFH[Hugging Face Hub<br/>Models & Datasets]
|
40 |
+
GRAN[NVIDIA Granary<br/>Multilingual ASR Dataset]
|
41 |
+
TRACK[Trackio Spaces<br/>Experiment Tracking]
|
42 |
+
end
|
43 |
+
|
44 |
+
%% Data Flow
|
45 |
+
UI --> DP
|
46 |
+
REC --> DP
|
47 |
+
UP --> DP
|
48 |
+
DP --> DS
|
49 |
+
|
50 |
+
DS --> TF
|
51 |
+
DS --> TL
|
52 |
+
TF --> TI
|
53 |
+
TL --> TI
|
54 |
+
|
55 |
+
TF --> MM
|
56 |
+
TL --> MM
|
57 |
+
MM --> MC
|
58 |
+
|
59 |
+
MM --> DEP
|
60 |
+
DEP --> HF
|
61 |
+
|
62 |
+
DS -.-> HFH
|
63 |
+
MM -.-> HFH
|
64 |
+
TI -.-> TRACK
|
65 |
+
DS -.-> GRAN
|
66 |
+
|
67 |
+
%% Styling
|
68 |
+
classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
|
69 |
+
classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
|
70 |
+
classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
|
71 |
+
classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
|
72 |
+
classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
|
73 |
+
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
74 |
+
|
75 |
+
class UI,REC,UP interface
|
76 |
+
class DP,DS processing
|
77 |
+
class TF,TL,TI training
|
78 |
+
class MM,MC management
|
79 |
+
class DEP,HF deployment
|
80 |
+
class HFH,GRAN,TRACK external
|
81 |
+
```
|
82 |
+
|
83 |
+
## Architecture Overview
|
84 |
+
|
85 |
+
This diagram shows the high-level architecture of the Voxtral ASR Fine-tuning application. The system is organized into several layers:
|
86 |
+
|
87 |
+
### 1. User Interface Layer
|
88 |
+
- **Gradio Web Interface**: Main user-facing application built with Gradio
|
89 |
+
- **Audio Recording**: Microphone input for recording speech samples
|
90 |
+
- **File Upload**: Support for uploading existing WAV/FLAC audio files
|
91 |
+
|
92 |
+
### 2. Data Processing Layer
|
93 |
+
- **Data Processing**: Audio resampling to 16kHz, JSONL dataset creation
|
94 |
+
- **Dataset Management**: Integration with NVIDIA Granary dataset and local dataset handling
|
95 |
+
|
96 |
+
### 3. Training Layer
|
97 |
+
- **Full Fine-tuning**: Complete model fine-tuning using `scripts/train.py`
|
98 |
+
- **LoRA Fine-tuning**: Parameter-efficient fine-tuning using `scripts/train_lora.py`
|
99 |
+
- **Trackio Integration**: Experiment tracking and logging
|
100 |
+
|
101 |
+
### 4. Model Management Layer
|
102 |
+
- **Model Management**: Local storage and Hugging Face Hub integration
|
103 |
+
- **Model Card Generation**: Automated model card creation
|
104 |
+
|
105 |
+
### 5. Deployment Layer
|
106 |
+
- **Demo Space Deployment**: Automated deployment to Hugging Face Spaces
|
107 |
+
- **Interactive Demo**: Live demo interface for testing fine-tuned models
|
108 |
+
|
109 |
+
### 6. External Services
|
110 |
+
- **Hugging Face Hub**: Model and dataset storage and sharing
|
111 |
+
- **NVIDIA Granary**: High-quality multilingual ASR dataset
|
112 |
+
- **Trackio Spaces**: Experiment tracking and visualization
|
113 |
+
|
114 |
+
## Key Workflows
|
115 |
+
|
116 |
+
1. **Dataset Creation**: Users can record audio or upload files β processed into JSONL format
|
117 |
+
2. **Model Training**: Datasets fed into training scripts with experiment tracking
|
118 |
+
3. **Model Publishing**: Trained models pushed to HF Hub with generated model cards
|
119 |
+
4. **Demo Deployment**: Automated deployment of interactive demos to HF Spaces
|
120 |
+
|
121 |
+
See also:
|
122 |
+
- [Interface Workflow](interface-workflow.md)
|
123 |
+
- [Training Pipeline](training-pipeline.md)
|
124 |
+
- [Deployment Pipeline](deployment-pipeline.md)
|
125 |
+
- [Data Flow](data-flow.md)
|
126 |
+
|
docs/data-flow.md
ADDED
@@ -0,0 +1,374 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Data Flow
|
2 |
+
|
3 |
+
```mermaid
|
4 |
+
flowchart TD
|
5 |
+
%% User Input Sources
|
6 |
+
subgraph "User Input"
|
7 |
+
MIC[π€ Microphone Recording<br/>Raw audio + timestamps]
|
8 |
+
FILE[π File Upload<br/>WAV/FLAC files]
|
9 |
+
TEXT[π Manual Transcripts<br/>Text input]
|
10 |
+
LANG[π Language Selection<br/>25+ languages]
|
11 |
+
end
|
12 |
+
|
13 |
+
%% Data Processing Pipeline
|
14 |
+
subgraph "Data Processing"
|
15 |
+
AUDIO_PROC[Audio Processing<br/>Resampling to 16kHz<br/>Format conversion]
|
16 |
+
TEXT_PROC[Text Processing<br/>Transcript validation<br/>Cleaning & formatting]
|
17 |
+
JSONL_CONV[JSONL Conversion<br/>{"audio_path": "...", "text": "..."}]
|
18 |
+
end
|
19 |
+
|
20 |
+
%% Dataset Storage
|
21 |
+
subgraph "Dataset Storage"
|
22 |
+
LOCAL_DS[Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/]
|
23 |
+
HF_DS[HF Hub Dataset<br/>username/dataset-name<br/>Public sharing]
|
24 |
+
end
|
25 |
+
|
26 |
+
%% Training Data Flow
|
27 |
+
subgraph "Training Data Pipeline"
|
28 |
+
DS_LOADER[Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()]
|
29 |
+
AUDIO_CAST[Audio Casting<br/>Audio(sampling_rate=16000)]
|
30 |
+
TRAIN_SPLIT[Train Split<br/>train_dataset]
|
31 |
+
EVAL_SPLIT[Eval Split<br/>eval_dataset]
|
32 |
+
end
|
33 |
+
|
34 |
+
%% Model Training
|
35 |
+
subgraph "Model Training"
|
36 |
+
COLLATOR[VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction]
|
37 |
+
FORWARD[Forward Pass<br/>Audio β Features β Text]
|
38 |
+
LOSS[Loss Calculation<br/>Masked LM loss]
|
39 |
+
BACKWARD[Backward Pass<br/>Gradient computation]
|
40 |
+
OPTIMIZE[Parameter Updates<br/>LoRA or full fine-tuning]
|
41 |
+
end
|
42 |
+
|
43 |
+
%% Training Outputs
|
44 |
+
subgraph "Training Outputs"
|
45 |
+
MODEL_FILES[Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json]
|
46 |
+
TRAINING_LOGS[Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves]
|
47 |
+
CHECKPOINTS[Checkpoints<br/>Intermediate models<br/>best model tracking]
|
48 |
+
end
|
49 |
+
|
50 |
+
%% Publishing Pipeline
|
51 |
+
subgraph "Publishing Pipeline"
|
52 |
+
HF_REPO[HF Repository<br/>username/model-name<br/>Model hosting]
|
53 |
+
MODEL_CARD[Model Card<br/>README.md<br/>Training details<br/>Usage examples]
|
54 |
+
METADATA[Training Metadata<br/>Config + results<br/>Performance metrics]
|
55 |
+
end
|
56 |
+
|
57 |
+
%% Demo Deployment
|
58 |
+
subgraph "Demo Deployment"
|
59 |
+
SPACE_REPO[HF Space Repository<br/>username/model-name-demo<br/>Demo hosting]
|
60 |
+
DEMO_APP[Demo Application<br/>Gradio interface<br/>Real-time inference]
|
61 |
+
ENV_VARS[Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets]
|
62 |
+
end
|
63 |
+
|
64 |
+
%% External Data Sources
|
65 |
+
subgraph "External Data Sources"
|
66 |
+
GRANARY[NVIDIA Granary<br/>Multilingual ASR data<br/>25+ languages]
|
67 |
+
HF_COMM[HF Community Datasets<br/>Public ASR datasets<br/>Standard formats]
|
68 |
+
end
|
69 |
+
|
70 |
+
%% Data Flow Connections
|
71 |
+
MIC --> AUDIO_PROC
|
72 |
+
FILE --> AUDIO_PROC
|
73 |
+
TEXT --> TEXT_PROC
|
74 |
+
LANG --> TEXT_PROC
|
75 |
+
|
76 |
+
AUDIO_PROC --> JSONL_CONV
|
77 |
+
TEXT_PROC --> JSONL_CONV
|
78 |
+
|
79 |
+
JSONL_CONV --> LOCAL_DS
|
80 |
+
LOCAL_DS --> HF_DS
|
81 |
+
|
82 |
+
LOCAL_DS --> DS_LOADER
|
83 |
+
HF_DS --> DS_LOADER
|
84 |
+
GRANARY --> DS_LOADER
|
85 |
+
HF_COMM --> DS_LOADER
|
86 |
+
|
87 |
+
DS_LOADER --> AUDIO_CAST
|
88 |
+
AUDIO_CAST --> TRAIN_SPLIT
|
89 |
+
AUDIO_CAST --> EVAL_SPLIT
|
90 |
+
|
91 |
+
TRAIN_SPLIT --> COLLATOR
|
92 |
+
EVAL_SPLIT --> COLLATOR
|
93 |
+
|
94 |
+
COLLATOR --> FORWARD
|
95 |
+
FORWARD --> LOSS
|
96 |
+
LOSS --> BACKWARD
|
97 |
+
BACKWARD --> OPTIMIZE
|
98 |
+
|
99 |
+
OPTIMIZE --> MODEL_FILES
|
100 |
+
OPTIMIZE --> TRAINING_LOGS
|
101 |
+
OPTIMIZE --> CHECKPOINTS
|
102 |
+
|
103 |
+
MODEL_FILES --> HF_REPO
|
104 |
+
TRAINING_LOGS --> HF_REPO
|
105 |
+
CHECKPOINTS --> HF_REPO
|
106 |
+
|
107 |
+
HF_REPO --> MODEL_CARD
|
108 |
+
TRAINING_LOGS --> MODEL_CARD
|
109 |
+
|
110 |
+
MODEL_CARD --> SPACE_REPO
|
111 |
+
HF_REPO --> SPACE_REPO
|
112 |
+
ENV_VARS --> SPACE_REPO
|
113 |
+
|
114 |
+
SPACE_REPO --> DEMO_APP
|
115 |
+
|
116 |
+
%% Styling
|
117 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
118 |
+
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
119 |
+
classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
120 |
+
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
121 |
+
classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
122 |
+
classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
123 |
+
classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
124 |
+
classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px
|
125 |
+
|
126 |
+
class MIC,FILE,TEXT,LANG input
|
127 |
+
class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
|
128 |
+
class LOCAL_DS,HF_DS storage
|
129 |
+
class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
|
130 |
+
class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
|
131 |
+
class HF_REPO,MODEL_CARD,METADATA publishing
|
132 |
+
class SPACE_REPO,DEMO_APP,ENV_VARS deployment
|
133 |
+
class GRANARY,HF_COMM external
|
134 |
+
```
|
135 |
+
|
136 |
+
## Data Flow Overview
|
137 |
+
|
138 |
+
This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo.
|
139 |
+
|
140 |
+
### Data Input Sources
|
141 |
+
|
142 |
+
#### User-Generated Data
|
143 |
+
- **Microphone Recording**: Raw audio captured through browser microphone
|
144 |
+
- **File Upload**: Existing WAV/FLAC audio files
|
145 |
+
- **Manual Transcripts**: User-provided text transcriptions
|
146 |
+
- **Language Selection**: Influences phrase selection from NVIDIA Granary
|
147 |
+
|
148 |
+
#### External Data Sources
|
149 |
+
- **NVIDIA Granary**: High-quality multilingual ASR dataset
|
150 |
+
- **HF Community Datasets**: Public datasets from Hugging Face Hub
|
151 |
+
|
152 |
+
### Data Processing Pipeline
|
153 |
+
|
154 |
+
#### Audio Processing
|
155 |
+
```python
|
156 |
+
# Audio resampling and format conversion
|
157 |
+
audio_data = librosa.load(audio_path, sr=16000)
|
158 |
+
# Convert to WAV format for consistency
|
159 |
+
sf.write(output_path, audio_data, 16000)
|
160 |
+
```
|
161 |
+
|
162 |
+
#### Text Processing
|
163 |
+
```python
|
164 |
+
# Text cleaning and validation
|
165 |
+
text = text.strip()
|
166 |
+
# Basic validation (length, content checks)
|
167 |
+
assert len(text) > 0, "Empty transcription"
|
168 |
+
```
|
169 |
+
|
170 |
+
#### JSONL Conversion
|
171 |
+
```python
|
172 |
+
# Standard format for all datasets
|
173 |
+
entry = {
|
174 |
+
"audio_path": str(audio_file_path),
|
175 |
+
"text": cleaned_transcription
|
176 |
+
}
|
177 |
+
# Write to JSONL file
|
178 |
+
with open(jsonl_path, "a") as f:
|
179 |
+
f.write(json.dumps(entry) + "\n")
|
180 |
+
```
|
181 |
+
|
182 |
+
### Dataset Storage
|
183 |
+
|
184 |
+
#### Local Storage Structure
|
185 |
+
```
|
186 |
+
datasets/voxtral_user/
|
187 |
+
βββ data.jsonl # Main dataset file
|
188 |
+
βββ recorded_data.jsonl # From recordings
|
189 |
+
βββ wavs/ # Audio files
|
190 |
+
βββ recording_0000.wav
|
191 |
+
βββ recording_0001.wav
|
192 |
+
βββ ...
|
193 |
+
```
|
194 |
+
|
195 |
+
#### HF Hub Storage
|
196 |
+
- **Public Datasets**: Shareable with community
|
197 |
+
- **Version Control**: Dataset versioning and updates
|
198 |
+
- **Standard Metadata**: Automatic README generation
|
199 |
+
|
200 |
+
### Training Data Pipeline
|
201 |
+
|
202 |
+
#### Dataset Loading
|
203 |
+
```python
|
204 |
+
# Load local JSONL
|
205 |
+
ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl")
|
206 |
+
|
207 |
+
# Load HF dataset
|
208 |
+
ds = load_dataset("username/dataset-name", split="train")
|
209 |
+
```
|
210 |
+
|
211 |
+
#### Audio Casting
|
212 |
+
```python
|
213 |
+
# Ensure consistent sampling rate
|
214 |
+
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
|
215 |
+
```
|
216 |
+
|
217 |
+
#### Train/Eval Split
|
218 |
+
```python
|
219 |
+
# Create train and eval datasets
|
220 |
+
train_dataset = ds.select(range(train_count))
|
221 |
+
eval_dataset = ds.select(range(train_count, train_count + eval_count))
|
222 |
+
```
|
223 |
+
|
224 |
+
### Training Process Flow
|
225 |
+
|
226 |
+
#### Data Collation
|
227 |
+
- **VoxtralDataCollator**: Custom collator for Voxtral model
|
228 |
+
- **Audio Processing**: Convert audio to model inputs
|
229 |
+
- **Prompt Construction**: Build `[AUDIO]...[AUDIO] <transcribe>` prompts
|
230 |
+
- **Text Tokenization**: Process transcription targets
|
231 |
+
- **Masking**: Mask audio prompt tokens during training
|
232 |
+
|
233 |
+
#### Forward Pass
|
234 |
+
1. **Audio Input**: Raw audio waveforms
|
235 |
+
2. **Audio Tower**: Extract audio features
|
236 |
+
3. **Language Model**: Generate transcription autoregressively
|
237 |
+
4. **Loss Calculation**: Compare generated vs target text
|
238 |
+
|
239 |
+
#### Backward Pass & Optimization
|
240 |
+
- **Gradient Computation**: Backpropagation
|
241 |
+
- **LoRA Updates**: Update only adapter parameters (LoRA mode)
|
242 |
+
- **Full Updates**: Update all parameters (full fine-tuning)
|
243 |
+
- **Optimizer Step**: Apply gradients with learning rate scheduling
|
244 |
+
|
245 |
+
### Training Outputs
|
246 |
+
|
247 |
+
#### Model Files
|
248 |
+
- **model.safetensors**: Model weights (safetensors format)
|
249 |
+
- **config.json**: Model configuration
|
250 |
+
- **tokenizer.json**: Tokenizer configuration
|
251 |
+
- **generation_config.json**: Generation parameters
|
252 |
+
|
253 |
+
#### Training Logs
|
254 |
+
- **train_results.json**: Final training metrics
|
255 |
+
- **eval_results.json**: Evaluation results
|
256 |
+
- **training_config.json**: Training hyperparameters
|
257 |
+
- **trainer_state.json**: Training state and checkpoints
|
258 |
+
|
259 |
+
#### Checkpoints
|
260 |
+
- **checkpoint-XXX/**: Intermediate model snapshots
|
261 |
+
- **best-model/**: Best performing model
|
262 |
+
- **final-model/**: Final trained model
|
263 |
+
|
264 |
+
### Publishing Pipeline
|
265 |
+
|
266 |
+
#### HF Repository Structure
|
267 |
+
```
|
268 |
+
username/model-name/
|
269 |
+
βββ model.safetensors.index.json
|
270 |
+
βββ model-00001-of-00002.safetensors
|
271 |
+
βββ model-00002-of-00002.safetensors
|
272 |
+
βββ config.json
|
273 |
+
βββ tokenizer.json
|
274 |
+
βββ training_config.json
|
275 |
+
βββ train_results.json
|
276 |
+
βββ README.md (model card)
|
277 |
+
βββ training_results/
|
278 |
+
βββ training.log
|
279 |
+
```
|
280 |
+
|
281 |
+
#### Model Card Generation
|
282 |
+
- **Template Processing**: Fill model_card.md template
|
283 |
+
- **Variable Injection**: Training config, results, metadata
|
284 |
+
- **Conditional Sections**: Handle quantized models, etc.
|
285 |
+
|
286 |
+
### Demo Deployment
|
287 |
+
|
288 |
+
#### Space Repository Structure
|
289 |
+
```
|
290 |
+
username/model-name-demo/
|
291 |
+
βββ app.py # Gradio demo application
|
292 |
+
βββ requirements.txt # Python dependencies
|
293 |
+
βββ README.md # Space documentation
|
294 |
+
βββ .env # Environment variables
|
295 |
+
```
|
296 |
+
|
297 |
+
#### Environment Configuration
|
298 |
+
```python
|
299 |
+
# Space environment variables
|
300 |
+
HF_MODEL_ID=username/model-name
|
301 |
+
MODEL_NAME=Voxtral Fine-tuned Model
|
302 |
+
HF_TOKEN=read_only_token # For model access
|
303 |
+
BRAND_OWNER_NAME=username
|
304 |
+
# ... other branding variables
|
305 |
+
```
|
306 |
+
|
307 |
+
### Data Flow Patterns
|
308 |
+
|
309 |
+
#### Streaming vs Batch Processing
|
310 |
+
- **Training Data**: Batch processing for efficiency
|
311 |
+
- **External Datasets**: Streaming loading for memory efficiency
|
312 |
+
- **User Input**: Real-time processing with immediate feedback
|
313 |
+
|
314 |
+
#### Data Validation
|
315 |
+
- **Input Validation**: Check audio format, sampling rate, text length
|
316 |
+
- **Quality Assurance**: Filter out empty or invalid entries
|
317 |
+
- **Consistency Checks**: Ensure audio-text alignment
|
318 |
+
|
319 |
+
#### Error Handling
|
320 |
+
- **Graceful Degradation**: Fallback to local data if external sources fail
|
321 |
+
- **Retry Logic**: Automatic retry for network failures
|
322 |
+
- **Logging**: Comprehensive error logging and debugging
|
323 |
+
|
324 |
+
### Performance Considerations
|
325 |
+
|
326 |
+
#### Memory Management
|
327 |
+
- **Streaming Loading**: Process large datasets without loading everything
|
328 |
+
- **Audio Caching**: Cache processed audio features
|
329 |
+
- **Batch Optimization**: Balance batch size with available memory
|
330 |
+
|
331 |
+
#### Storage Optimization
|
332 |
+
- **Compression**: Use efficient audio formats
|
333 |
+
- **Deduplication**: Avoid duplicate data entries
|
334 |
+
- **Cleanup**: Remove temporary files after processing
|
335 |
+
|
336 |
+
#### Network Efficiency
|
337 |
+
- **Incremental Uploads**: Upload files as they're ready
|
338 |
+
- **Resume Capability**: Resume interrupted uploads
|
339 |
+
- **Caching**: Cache frequently accessed data
|
340 |
+
|
341 |
+
### Security & Privacy
|
342 |
+
|
343 |
+
#### Data Privacy
|
344 |
+
- **Local Processing**: Audio files processed locally when possible
|
345 |
+
- **User Consent**: Clear data usage policies
|
346 |
+
- **Anonymization**: Remove personally identifiable information
|
347 |
+
|
348 |
+
#### Access Control
|
349 |
+
- **Token Management**: Secure HF token storage
|
350 |
+
- **Repository Permissions**: Appropriate public/private settings
|
351 |
+
- **Rate Limiting**: Prevent abuse of demo interfaces
|
352 |
+
|
353 |
+
### Monitoring & Analytics
|
354 |
+
|
355 |
+
#### Data Quality Metrics
|
356 |
+
- **Audio Quality**: Sampling rate, format validation
|
357 |
+
- **Text Quality**: Length, language detection, consistency
|
358 |
+
- **Dataset Statistics**: Size, distribution, coverage
|
359 |
+
|
360 |
+
#### Performance Metrics
|
361 |
+
- **Processing Time**: Data loading, preprocessing, training time
|
362 |
+
- **Model Metrics**: Loss, perplexity, WER (if available)
|
363 |
+
- **Resource Usage**: Memory, CPU/GPU utilization
|
364 |
+
|
365 |
+
#### User Analytics
|
366 |
+
- **Usage Patterns**: Popular languages, dataset sizes
|
367 |
+
- **Success Rates**: Training completion, deployment success
|
368 |
+
- **Error Patterns**: Common failure modes and solutions
|
369 |
+
|
370 |
+
See also:
|
371 |
+
- [Architecture Overview](architecture.md)
|
372 |
+
- [Interface Workflow](interface-workflow.md)
|
373 |
+
- [Training Pipeline](training-pipeline.md)
|
374 |
+
|
docs/deployment-pipeline.md
ADDED
@@ -0,0 +1,323 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Deployment Pipeline
|
2 |
+
|
3 |
+
```mermaid
|
4 |
+
graph TB
|
5 |
+
%% Input Sources
|
6 |
+
subgraph "Inputs"
|
7 |
+
TRAINED_MODEL[Trained Model<br/>Local directory]
|
8 |
+
TRAINING_CONFIG[Training Config<br/>JSON/YAML]
|
9 |
+
TRAINING_RESULTS[Training Results<br/>Metrics & logs]
|
10 |
+
MODEL_METADATA[Model Metadata<br/>Name, description, etc.]
|
11 |
+
end
|
12 |
+
|
13 |
+
%% Model Publishing
|
14 |
+
subgraph "Model Publishing"
|
15 |
+
PUSH_SCRIPT[push_to_huggingface.py<br/>Model Publisher]
|
16 |
+
|
17 |
+
subgraph "Publishing Steps"
|
18 |
+
REPO_CREATION[Repository Creation<br/>HF Hub API]
|
19 |
+
FILE_UPLOAD[File Upload<br/>Model files to HF]
|
20 |
+
METADATA_UPLOAD[Metadata Upload<br/>Config & results]
|
21 |
+
end
|
22 |
+
end
|
23 |
+
|
24 |
+
%% Model Card Generation
|
25 |
+
subgraph "Model Card Generation"
|
26 |
+
CARD_SCRIPT[generate_model_card.py<br/>Card Generator]
|
27 |
+
|
28 |
+
subgraph "Card Components"
|
29 |
+
TEMPLATE_LOAD[Template Loading<br/>model_card.md]
|
30 |
+
VARIABLE_REPLACEMENT[Variable Replacement<br/>Config injection]
|
31 |
+
CONDITIONAL_PROCESSING[Conditional Sections<br/>Quantized models, etc.]
|
32 |
+
end
|
33 |
+
end
|
34 |
+
|
35 |
+
%% Demo Space Deployment
|
36 |
+
subgraph "Demo Space Deployment"
|
37 |
+
DEPLOY_SCRIPT[deploy_demo_space.py<br/>Space Deployer]
|
38 |
+
|
39 |
+
subgraph "Space Setup"
|
40 |
+
SPACE_CREATION[Space Repository<br/>Create HF Space]
|
41 |
+
TEMPLATE_COPY[Template Copying<br/>demo_voxtral/ files]
|
42 |
+
ENV_INJECTION[Environment Setup<br/>Model config injection]
|
43 |
+
SECRET_SETUP[Secret Configuration<br/>HF_TOKEN, model vars]
|
44 |
+
end
|
45 |
+
end
|
46 |
+
|
47 |
+
%% Space Building & Testing
|
48 |
+
subgraph "Space Building"
|
49 |
+
BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
|
50 |
+
DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
|
51 |
+
MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
|
52 |
+
APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
|
53 |
+
end
|
54 |
+
|
55 |
+
%% Live Demo
|
56 |
+
subgraph "Live Demo Space"
|
57 |
+
GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
|
58 |
+
MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
|
59 |
+
USER_INTERACTION[User Interaction<br/>Audio upload/playback]
|
60 |
+
end
|
61 |
+
|
62 |
+
%% External Services
|
63 |
+
subgraph "External Services"
|
64 |
+
HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
|
65 |
+
HF_SPACES[HF Spaces Platform<br/>Demo hosting]
|
66 |
+
end
|
67 |
+
|
68 |
+
%% Flow Connections
|
69 |
+
TRAINED_MODEL --> PUSH_SCRIPT
|
70 |
+
TRAINING_CONFIG --> PUSH_SCRIPT
|
71 |
+
TRAINING_RESULTS --> PUSH_SCRIPT
|
72 |
+
MODEL_METADATA --> PUSH_SCRIPT
|
73 |
+
|
74 |
+
PUSH_SCRIPT --> REPO_CREATION
|
75 |
+
REPO_CREATION --> FILE_UPLOAD
|
76 |
+
FILE_UPLOAD --> METADATA_UPLOAD
|
77 |
+
|
78 |
+
METADATA_UPLOAD --> CARD_SCRIPT
|
79 |
+
TRAINING_CONFIG --> CARD_SCRIPT
|
80 |
+
TRAINING_RESULTS --> CARD_SCRIPT
|
81 |
+
|
82 |
+
CARD_SCRIPT --> TEMPLATE_LOAD
|
83 |
+
TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
|
84 |
+
VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
|
85 |
+
|
86 |
+
CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
|
87 |
+
METADATA_UPLOAD --> DEPLOY_SCRIPT
|
88 |
+
|
89 |
+
DEPLOY_SCRIPT --> SPACE_CREATION
|
90 |
+
SPACE_CREATION --> TEMPLATE_COPY
|
91 |
+
TEMPLATE_COPY --> ENV_INJECTION
|
92 |
+
ENV_INJECTION --> SECRET_SETUP
|
93 |
+
|
94 |
+
SECRET_SETUP --> BUILD_TRIGGER
|
95 |
+
BUILD_TRIGGER --> DEPENDENCY_INSTALL
|
96 |
+
DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
|
97 |
+
MODEL_DOWNLOAD --> APP_INITIALIZATION
|
98 |
+
|
99 |
+
APP_INITIALIZATION --> GRADIO_INTERFACE
|
100 |
+
GRADIO_INTERFACE --> MODEL_INFERENCE
|
101 |
+
MODEL_INFERENCE --> USER_INTERACTION
|
102 |
+
|
103 |
+
HF_HUB --> MODEL_DOWNLOAD
|
104 |
+
HF_SPACES --> GRADIO_INTERFACE
|
105 |
+
|
106 |
+
%% Styling
|
107 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
108 |
+
classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
109 |
+
classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
110 |
+
classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
111 |
+
classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
112 |
+
classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
113 |
+
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
114 |
+
|
115 |
+
class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
|
116 |
+
class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
|
117 |
+
class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
|
118 |
+
class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
|
119 |
+
class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
|
120 |
+
class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
|
121 |
+
class HF_HUB,HF_SPACES external
|
122 |
+
```
|
123 |
+
|
124 |
+
## Deployment Pipeline Overview
|
125 |
+
|
126 |
+
This diagram illustrates the complete deployment pipeline that takes a trained Voxtral model and makes it available as an interactive demo on Hugging Face Spaces.
|
127 |
+
|
128 |
+
### Input Sources
|
129 |
+
|
130 |
+
#### Trained Model Artifacts
|
131 |
+
- **Model Files**: `model.safetensors`, `config.json`, `tokenizer.json`
|
132 |
+
- **Training Config**: Hyperparameters and training setup
|
133 |
+
- **Training Results**: Metrics, loss curves, evaluation results
|
134 |
+
- **Model Metadata**: Name, description, base model information
|
135 |
+
|
136 |
+
### Model Publishing Phase
|
137 |
+
|
138 |
+
#### push_to_huggingface.py Script
|
139 |
+
```python
|
140 |
+
# Initialize publisher
|
141 |
+
pusher = HuggingFacePusher(
|
142 |
+
model_path=output_dir,
|
143 |
+
repo_name=repo_name,
|
144 |
+
token=hf_token
|
145 |
+
)
|
146 |
+
|
147 |
+
# Push model
|
148 |
+
success = pusher.push_model(training_config, results)
|
149 |
+
```
|
150 |
+
|
151 |
+
#### Publishing Steps
|
152 |
+
1. **Repository Creation**: Create HF Hub repository
|
153 |
+
2. **File Upload**: Upload all model files
|
154 |
+
3. **Metadata Upload**: Upload training config and results
|
155 |
+
|
156 |
+
### Model Card Generation
|
157 |
+
|
158 |
+
#### generate_model_card.py Script
|
159 |
+
```python
|
160 |
+
# Create generator
|
161 |
+
generator = ModelCardGenerator()
|
162 |
+
|
163 |
+
# Generate card
|
164 |
+
variables = {
|
165 |
+
"model_name": model_name,
|
166 |
+
"repo_name": repo_id,
|
167 |
+
"base_model": base_model,
|
168 |
+
# ... other variables
|
169 |
+
}
|
170 |
+
content = generator.generate_model_card(variables)
|
171 |
+
```
|
172 |
+
|
173 |
+
#### Card Processing
|
174 |
+
1. **Template Loading**: Load from `templates/model_card.md`
|
175 |
+
2. **Variable Replacement**: Inject actual values
|
176 |
+
3. **Conditional Processing**: Handle optional sections
|
177 |
+
|
178 |
+
### Demo Space Deployment
|
179 |
+
|
180 |
+
#### deploy_demo_space.py Script
|
181 |
+
```python
|
182 |
+
# Initialize deployer
|
183 |
+
deployer = DemoSpaceDeployer(
|
184 |
+
hf_token=token,
|
185 |
+
hf_username=username,
|
186 |
+
model_id=model_id,
|
187 |
+
demo_type="voxtral"
|
188 |
+
)
|
189 |
+
|
190 |
+
# Deploy space
|
191 |
+
success = deployer.deploy()
|
192 |
+
```
|
193 |
+
|
194 |
+
#### Space Setup Process
|
195 |
+
1. **Space Creation**: Create HF Space repository
|
196 |
+
2. **Template Copying**: Copy demo template files
|
197 |
+
3. **Environment Injection**: Set model-specific variables
|
198 |
+
4. **Secret Configuration**: Configure HF_TOKEN and model variables
|
199 |
+
|
200 |
+
### Space Building Process
|
201 |
+
|
202 |
+
#### Automatic Build Trigger
|
203 |
+
- **Dependency Installation**: `pip install -r requirements.txt`
|
204 |
+
- **Model Download**: Download model from HF Hub
|
205 |
+
- **App Initialization**: Setup Gradio application
|
206 |
+
|
207 |
+
#### Demo Template Structure
|
208 |
+
```
|
209 |
+
templates/spaces/demo_voxtral/
|
210 |
+
βββ app.py # Main Gradio application
|
211 |
+
βββ requirements.txt # Python dependencies
|
212 |
+
βββ README.md # Space documentation
|
213 |
+
```
|
214 |
+
|
215 |
+
### Live Demo Features
|
216 |
+
|
217 |
+
#### Gradio Interface
|
218 |
+
- **Audio Upload**: File upload or recording
|
219 |
+
- **Real-time Inference**: Live ASR transcription
|
220 |
+
- **Interactive Controls**: Model parameters, settings
|
221 |
+
|
222 |
+
#### Model Inference Pipeline
|
223 |
+
- **Audio Processing**: Convert to model inputs
|
224 |
+
- **Transcription Generation**: Run ASR inference
|
225 |
+
- **Result Display**: Show transcription with confidence
|
226 |
+
|
227 |
+
### Configuration Management
|
228 |
+
|
229 |
+
#### Environment Variables
|
230 |
+
```python
|
231 |
+
# Set in Space secrets/environment
|
232 |
+
os.environ['HF_MODEL_ID'] = model_id
|
233 |
+
os.environ['MODEL_NAME'] = model_name
|
234 |
+
os.environ['HF_TOKEN'] = token # For model access
|
235 |
+
```
|
236 |
+
|
237 |
+
#### Demo-Specific Settings
|
238 |
+
- **Model Configuration**: Base model, subfolder, quantization
|
239 |
+
- **UI Branding**: Custom titles, descriptions, links
|
240 |
+
- **Example Prompts**: Pre-configured demo examples
|
241 |
+
|
242 |
+
### Error Handling & Monitoring
|
243 |
+
|
244 |
+
#### Build Process Monitoring
|
245 |
+
- **Build Logs**: Real-time build status
|
246 |
+
- **Error Detection**: Failed dependency installation
|
247 |
+
- **Retry Logic**: Automatic rebuild on failure
|
248 |
+
|
249 |
+
#### Runtime Monitoring
|
250 |
+
- **Space Health**: Uptime and responsiveness
|
251 |
+
- **Model Loading**: Successful model initialization
|
252 |
+
- **Inference Errors**: Runtime error handling
|
253 |
+
|
254 |
+
### Security Considerations
|
255 |
+
|
256 |
+
#### Token Management
|
257 |
+
- **Read-Only Tokens**: Use read-only tokens for demo spaces
|
258 |
+
- **Secret Storage**: Secure storage of HF_TOKEN
|
259 |
+
- **Access Control**: Proper repository permissions
|
260 |
+
|
261 |
+
#### Resource Management
|
262 |
+
- **Memory Limits**: Space hardware constraints
|
263 |
+
- **Timeout Handling**: Inference timeout protection
|
264 |
+
- **Rate Limiting**: Prevent abuse
|
265 |
+
|
266 |
+
### Integration Points
|
267 |
+
|
268 |
+
#### With Training Scripts
|
269 |
+
- **Training Config**: Used for model card generation
|
270 |
+
- **Training Results**: Included in model metadata
|
271 |
+
- **Model Path**: Direct path to trained model files
|
272 |
+
|
273 |
+
#### With Interface (interface.py)
|
274 |
+
- **Parameter Passing**: Deployment settings from UI
|
275 |
+
- **Progress Updates**: Deployment progress to user
|
276 |
+
- **Result Links**: Direct links to deployed spaces
|
277 |
+
|
278 |
+
### Deployment Workflows
|
279 |
+
|
280 |
+
#### Full Pipeline (Recommended)
|
281 |
+
1. Train model β Generate model card β Push to Hub β Deploy demo
|
282 |
+
2. All steps automated through single interface action
|
283 |
+
3. Comprehensive error handling and rollback
|
284 |
+
|
285 |
+
#### Manual Deployment
|
286 |
+
1. Use individual scripts for granular control
|
287 |
+
2. Custom configuration and branding
|
288 |
+
3. Debugging and troubleshooting capabilities
|
289 |
+
|
290 |
+
#### CI/CD Integration
|
291 |
+
- **Automated Triggers**: GitHub Actions integration
|
292 |
+
- **Version Control**: Model versioning and releases
|
293 |
+
- **Testing**: Automated demo testing
|
294 |
+
|
295 |
+
### Performance Optimization
|
296 |
+
|
297 |
+
#### Space Hardware Selection
|
298 |
+
- **CPU Basic**: Free tier, sufficient for small models
|
299 |
+
- **GPU Options**: For larger models requiring acceleration
|
300 |
+
- **Memory Scaling**: Based on model size requirements
|
301 |
+
|
302 |
+
#### Model Optimization
|
303 |
+
- **Quantization**: 4-bit quantization for smaller footprint
|
304 |
+
- **Model Sharding**: Split large models across memory
|
305 |
+
- **Caching**: Model caching for faster cold starts
|
306 |
+
|
307 |
+
### Monitoring & Analytics
|
308 |
+
|
309 |
+
#### Space Analytics
|
310 |
+
- **Usage Metrics**: Daily active users, session duration
|
311 |
+
- **Performance Metrics**: Inference latency, error rates
|
312 |
+
- **User Feedback**: Demo effectiveness and issues
|
313 |
+
|
314 |
+
#### Model Analytics
|
315 |
+
- **Download Stats**: Model popularity and usage
|
316 |
+
- **Citation Tracking**: Academic and research usage
|
317 |
+
- **Community Feedback**: GitHub issues and discussions
|
318 |
+
|
319 |
+
See also:
|
320 |
+
- [Architecture Overview](architecture.md)
|
321 |
+
- [Training Pipeline](training-pipeline.md)
|
322 |
+
- [Data Flow](data-flow.md)
|
323 |
+
|
docs/diagrams.html
ADDED
@@ -0,0 +1,728 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html lang="en">
|
3 |
+
<head>
|
4 |
+
<meta charset="UTF-8">
|
5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
6 |
+
<title>Voxtral ASR Fine-tuning - Architecture Diagrams</title>
|
7 |
+
<script type="module">
|
8 |
+
import mermaid from 'https://cdn.jsdelivr.net/npm/[email protected]/dist/mermaid.esm.min.mjs';
|
9 |
+
mermaid.initialize({
|
10 |
+
startOnLoad: true,
|
11 |
+
theme: 'base',
|
12 |
+
themeVariables: {
|
13 |
+
primaryColor: '#e3f2fd',
|
14 |
+
primaryTextColor: '#1976d2',
|
15 |
+
primaryBorderColor: '#01579b',
|
16 |
+
lineColor: '#424242',
|
17 |
+
secondaryColor: '#fff3e0',
|
18 |
+
tertiaryColor: '#fce4ec',
|
19 |
+
background: '#ffffff',
|
20 |
+
mainBkg: '#ffffff',
|
21 |
+
secondBkg: '#f5f5f5',
|
22 |
+
textColor: '#333333'
|
23 |
+
},
|
24 |
+
flowchart: {
|
25 |
+
useMaxWidth: true,
|
26 |
+
htmlLabels: true,
|
27 |
+
curve: 'basis'
|
28 |
+
},
|
29 |
+
sequence: {
|
30 |
+
useMaxWidth: true
|
31 |
+
}
|
32 |
+
});
|
33 |
+
</script>
|
34 |
+
<style>
|
35 |
+
body {
|
36 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
37 |
+
line-height: 1.6;
|
38 |
+
color: #333;
|
39 |
+
max-width: 1200px;
|
40 |
+
margin: 0 auto;
|
41 |
+
padding: 20px;
|
42 |
+
background: #f8f9fa;
|
43 |
+
}
|
44 |
+
|
45 |
+
.header {
|
46 |
+
text-align: center;
|
47 |
+
margin-bottom: 40px;
|
48 |
+
padding: 20px;
|
49 |
+
background: white;
|
50 |
+
border-radius: 8px;
|
51 |
+
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
52 |
+
}
|
53 |
+
|
54 |
+
.diagram-container {
|
55 |
+
background: white;
|
56 |
+
margin: 20px 0;
|
57 |
+
padding: 20px;
|
58 |
+
border-radius: 8px;
|
59 |
+
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
60 |
+
}
|
61 |
+
|
62 |
+
.diagram-title {
|
63 |
+
font-size: 1.5em;
|
64 |
+
font-weight: bold;
|
65 |
+
margin-bottom: 15px;
|
66 |
+
color: #1976d2;
|
67 |
+
border-bottom: 2px solid #e3f2fd;
|
68 |
+
padding-bottom: 10px;
|
69 |
+
}
|
70 |
+
|
71 |
+
.diagram-description {
|
72 |
+
margin-bottom: 20px;
|
73 |
+
color: #666;
|
74 |
+
font-style: italic;
|
75 |
+
}
|
76 |
+
|
77 |
+
.navigation {
|
78 |
+
position: fixed;
|
79 |
+
top: 20px;
|
80 |
+
right: 20px;
|
81 |
+
background: white;
|
82 |
+
padding: 15px;
|
83 |
+
border-radius: 8px;
|
84 |
+
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
85 |
+
max-width: 200px;
|
86 |
+
}
|
87 |
+
|
88 |
+
.nav-link {
|
89 |
+
display: block;
|
90 |
+
padding: 8px 0;
|
91 |
+
color: #1976d2;
|
92 |
+
text-decoration: none;
|
93 |
+
border-bottom: 1px solid #eee;
|
94 |
+
}
|
95 |
+
|
96 |
+
.nav-link:hover {
|
97 |
+
color: #01579b;
|
98 |
+
text-decoration: underline;
|
99 |
+
}
|
100 |
+
|
101 |
+
.nav-link:last-child {
|
102 |
+
border-bottom: none;
|
103 |
+
}
|
104 |
+
|
105 |
+
.code-toggle {
|
106 |
+
background: #f5f5f5;
|
107 |
+
border: 1px solid #ddd;
|
108 |
+
padding: 10px;
|
109 |
+
margin: 10px 0;
|
110 |
+
border-radius: 4px;
|
111 |
+
cursor: pointer;
|
112 |
+
font-size: 0.9em;
|
113 |
+
}
|
114 |
+
|
115 |
+
.mermaid-code {
|
116 |
+
display: none;
|
117 |
+
background: #f8f9fa;
|
118 |
+
border: 1px solid #dee2e6;
|
119 |
+
border-radius: 4px;
|
120 |
+
padding: 15px;
|
121 |
+
margin: 10px 0;
|
122 |
+
font-family: 'Courier New', monospace;
|
123 |
+
font-size: 0.85em;
|
124 |
+
white-space: pre-wrap;
|
125 |
+
overflow-x: auto;
|
126 |
+
}
|
127 |
+
|
128 |
+
.download-btn {
|
129 |
+
background: #1976d2;
|
130 |
+
color: white;
|
131 |
+
border: none;
|
132 |
+
padding: 8px 16px;
|
133 |
+
border-radius: 4px;
|
134 |
+
cursor: pointer;
|
135 |
+
font-size: 0.9em;
|
136 |
+
margin: 10px 5px 10px 0;
|
137 |
+
}
|
138 |
+
|
139 |
+
.download-btn:hover {
|
140 |
+
background: #01579b;
|
141 |
+
}
|
142 |
+
|
143 |
+
@media print {
|
144 |
+
.navigation, .code-toggle, .download-btn {
|
145 |
+
display: none;
|
146 |
+
}
|
147 |
+
.diagram-container {
|
148 |
+
break-inside: avoid;
|
149 |
+
margin: 10px 0;
|
150 |
+
}
|
151 |
+
}
|
152 |
+
</style>
|
153 |
+
</head>
|
154 |
+
<body>
|
155 |
+
<div class="header">
|
156 |
+
<h1>π― Voxtral ASR Fine-tuning</h1>
|
157 |
+
<h2>Architecture & Workflow Diagrams</h2>
|
158 |
+
<p>Interactive documentation with Mermaid diagrams</p>
|
159 |
+
</div>
|
160 |
+
|
161 |
+
<nav class="navigation">
|
162 |
+
<strong>Quick Navigation</strong>
|
163 |
+
<a href="#overview" class="nav-link">Overview</a>
|
164 |
+
<a href="#architecture" class="nav-link">Architecture</a>
|
165 |
+
<a href="#interface" class="nav-link">Interface Workflow</a>
|
166 |
+
<a href="#training" class="nav-link">Training Pipeline</a>
|
167 |
+
<a href="#deployment" class="nav-link">Deployment Pipeline</a>
|
168 |
+
<a href="#dataflow" class="nav-link">Data Flow</a>
|
169 |
+
</nav>
|
170 |
+
|
171 |
+
<div id="overview" class="diagram-container">
|
172 |
+
<div class="diagram-title">π Documentation Overview</div>
|
173 |
+
<div class="diagram-description">
|
174 |
+
High-level overview of the Voxtral ASR Fine-tuning application and its documentation structure.
|
175 |
+
</div>
|
176 |
+
<div class="mermaid">
|
177 |
+
graph TD
|
178 |
+
START(["Voxtral ASR Fine-tuning App"]) --> OVERVIEW{Choose Documentation}
|
179 |
+
|
180 |
+
OVERVIEW --> ARCH["Architecture Overview"]
|
181 |
+
OVERVIEW --> WORKFLOW["Interface Workflow"]
|
182 |
+
OVERVIEW --> TRAINING["Training Pipeline"]
|
183 |
+
OVERVIEW --> DEPLOYMENT["Deployment Pipeline"]
|
184 |
+
OVERVIEW --> DATAFLOW["Data Flow"]
|
185 |
+
|
186 |
+
ARCH --> ARCH_DIAG["High-level Architecture<br/>System Components & Layers"]
|
187 |
+
WORKFLOW --> WORKFLOW_DIAG["User Journey<br/>Recording β Training β Demo"]
|
188 |
+
TRAINING --> TRAINING_DIAG["Training Scripts<br/>Data β Model β Results"]
|
189 |
+
DEPLOYMENT --> DEPLOYMENT_DIAG["Publishing & Demo<br/>Model β Hub β Space"]
|
190 |
+
DATAFLOW --> DATAFLOW_DIAG["Complete Data Journey<br/>Input β Processing β Output"]
|
191 |
+
|
192 |
+
subgraph "Core Components"
|
193 |
+
INTERFACE["interface.py<br/>Gradio Web UI"]
|
194 |
+
TRAIN_SCRIPTS["scripts/train*.py<br/>Training Scripts"]
|
195 |
+
DEPLOY_SCRIPT["scripts/deploy_demo_space.py<br/>Demo Deployment"]
|
196 |
+
PUSH_SCRIPT["scripts/push_to_huggingface.py<br/>Model Publishing"]
|
197 |
+
end
|
198 |
+
|
199 |
+
subgraph "Key Data Formats"
|
200 |
+
JSONL["JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
|
201 |
+
HFDATA["HF Hub Models<br/>username/model-name"]
|
202 |
+
SPACES["HF Spaces<br/>Interactive Demos"]
|
203 |
+
end
|
204 |
+
|
205 |
+
INTERFACE --> WORKFLOW
|
206 |
+
TRAIN_SCRIPTS --> TRAINING
|
207 |
+
DEPLOY_SCRIPT --> DEPLOYMENT
|
208 |
+
PUSH_SCRIPT --> DEPLOYMENT
|
209 |
+
|
210 |
+
JSONL --> DATAFLOW
|
211 |
+
HFDATA --> DEPLOYMENT
|
212 |
+
SPACES --> DEPLOYMENT
|
213 |
+
|
214 |
+
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
|
215 |
+
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
216 |
+
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
217 |
+
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
218 |
+
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
219 |
+
|
220 |
+
class START entry
|
221 |
+
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
|
222 |
+
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
|
223 |
+
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
|
224 |
+
class JSONL,HFDATA,SPACES data
|
225 |
+
</div>
|
226 |
+
</div>
|
227 |
+
|
228 |
+
<div id="architecture" class="diagram-container">
|
229 |
+
<div class="diagram-title">System Architecture</div>
|
230 |
+
<div class="diagram-description">
|
231 |
+
High-level architecture showing the main components and their relationships in the Voxtral ASR Fine-tuning application.
|
232 |
+
</div>
|
233 |
+
<div class="mermaid">
|
234 |
+
graph TB
|
235 |
+
subgraph "User Interface"
|
236 |
+
UI["Gradio Web Interface<br/>interface.py"]
|
237 |
+
REC["Audio Recording<br/>Microphone Input"]
|
238 |
+
UP["File Upload<br/>WAV/FLAC files"]
|
239 |
+
end
|
240 |
+
|
241 |
+
subgraph "Data Processing"
|
242 |
+
DP["Data Processing<br/>Audio resampling<br/>JSONL creation"]
|
243 |
+
DS["Dataset Management<br/>NVIDIA Granary<br/>Local datasets"]
|
244 |
+
end
|
245 |
+
|
246 |
+
subgraph "Training Pipeline"
|
247 |
+
TF["Full Fine-tuning<br/>scripts/train.py"]
|
248 |
+
TL["LoRA Fine-tuning<br/>scripts/train_lora.py"]
|
249 |
+
TI["Trackio Integration<br/>Experiment Tracking"]
|
250 |
+
end
|
251 |
+
|
252 |
+
subgraph "Model Management"
|
253 |
+
MM["Model Management<br/>Hugging Face Hub<br/>Local storage"]
|
254 |
+
MC["Model Card Generation<br/>scripts/generate_model_card.py"]
|
255 |
+
end
|
256 |
+
|
257 |
+
subgraph "Deployment & Demo"
|
258 |
+
DEP["Demo Space Deployment<br/>scripts/deploy_demo_space.py"]
|
259 |
+
HF["HF Spaces<br/>Interactive Demo"]
|
260 |
+
end
|
261 |
+
|
262 |
+
subgraph "External Services"
|
263 |
+
HFH["Hugging Face Hub<br/>Models & Datasets"]
|
264 |
+
GRAN["NVIDIA Granary<br/>Multilingual ASR Dataset"]
|
265 |
+
TRACK["Trackio Spaces<br/>Experiment Tracking"]
|
266 |
+
end
|
267 |
+
|
268 |
+
UI --> DP
|
269 |
+
REC --> DP
|
270 |
+
UP --> DP
|
271 |
+
DP --> DS
|
272 |
+
|
273 |
+
DS --> TF
|
274 |
+
DS --> TL
|
275 |
+
TF --> TI
|
276 |
+
TL --> TI
|
277 |
+
|
278 |
+
TF --> MM
|
279 |
+
TL --> MM
|
280 |
+
MM --> MC
|
281 |
+
|
282 |
+
MM --> DEP
|
283 |
+
DEP --> HF
|
284 |
+
|
285 |
+
DS -.-> HFH
|
286 |
+
MM -.-> HFH
|
287 |
+
TI -.-> TRACK
|
288 |
+
DS -.-> GRAN
|
289 |
+
|
290 |
+
classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
|
291 |
+
classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
|
292 |
+
classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
|
293 |
+
classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
|
294 |
+
classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
|
295 |
+
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
296 |
+
|
297 |
+
class UI,REC,UP interface
|
298 |
+
class DP,DS processing
|
299 |
+
class TF,TL,TI training
|
300 |
+
class MM,MC management
|
301 |
+
class DEP,HF deployment
|
302 |
+
class HFH,GRAN,TRACK external
|
303 |
+
</div>
|
304 |
+
</div>
|
305 |
+
|
306 |
+
<div id="interface" class="diagram-container">
|
307 |
+
<div class="diagram-title">Interface Workflow</div>
|
308 |
+
<div class="diagram-description">
|
309 |
+
Complete user journey through the Voxtral ASR Fine-tuning interface, from language selection to demo deployment.
|
310 |
+
</div>
|
311 |
+
<div class="mermaid">
|
312 |
+
flowchart TD
|
313 |
+
START(["User Opens Interface"]) --> LANG["Language Selection<br/>Choose from 25+ languages"]
|
314 |
+
LANG --> PHRASES["Load Phrases<br/>From NVIDIA Granary"]
|
315 |
+
PHRASES --> RECORD["Recording Interface<br/>Display phrases + audio recording"]
|
316 |
+
|
317 |
+
RECORD --> |User Records| PROCESS_REC["Process Recordings<br/>Save WAV files + transcripts"]
|
318 |
+
RECORD --> |Upload Files| PROCESS_UPLOAD["Process Uploads<br/>Handle existing files + transcripts"]
|
319 |
+
|
320 |
+
PROCESS_REC --> JSONL["Create JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
|
321 |
+
PROCESS_UPLOAD --> JSONL
|
322 |
+
|
323 |
+
JSONL --> CONFIG["Training Configuration<br/>Model, LoRA/full, hyperparameters"]
|
324 |
+
CONFIG --> TRAIN["Training Process<br/>Execute train.py or train_lora.py"]
|
325 |
+
|
326 |
+
TRAIN --> PUSH["Push to Hub<br/>Model + metadata to HF Hub"]
|
327 |
+
TRAIN --> CARD["Generate Model Card<br/>Automated documentation"]
|
328 |
+
PUSH --> DEPLOY["Deploy Demo Space<br/>Interactive demo on HF Spaces"]
|
329 |
+
|
330 |
+
DEPLOY --> END(["Demo Ready<br/>Interactive ASR Demo"])
|
331 |
+
|
332 |
+
PUSH -.-> END
|
333 |
+
CARD -.-> END
|
334 |
+
|
335 |
+
classDef start fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
|
336 |
+
classDef process fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
337 |
+
classDef decision fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
338 |
+
classDef terminal fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
|
339 |
+
|
340 |
+
class START start
|
341 |
+
class END terminal
|
342 |
+
class LANG,PHRASES,RECORD,PROCESS_REC,PROCESS_UPLOAD,JSONL,CONFIG,TRAIN,PUSH,CARD,DEPLOY process
|
343 |
+
</div>
|
344 |
+
</div>
|
345 |
+
|
346 |
+
<div id="training" class="diagram-container">
|
347 |
+
<div class="diagram-title">Training Pipeline</div>
|
348 |
+
<div class="diagram-description">
|
349 |
+
Detailed training pipeline showing how data flows through training scripts and supporting infrastructure.
|
350 |
+
</div>
|
351 |
+
<div class="mermaid">
|
352 |
+
graph TB
|
353 |
+
subgraph "Data Sources"
|
354 |
+
JSONL["JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
|
355 |
+
GRANARY["NVIDIA Granary Dataset<br/>Multilingual ASR Data"]
|
356 |
+
HFDATA["HF Hub Datasets<br/>Community Datasets"]
|
357 |
+
end
|
358 |
+
|
359 |
+
subgraph "Data Processing"
|
360 |
+
LOADER["Dataset Loader<br/>_load_jsonl_dataset()"]
|
361 |
+
CASTER["Audio Casting<br/>16kHz resampling"]
|
362 |
+
COLLATOR["VoxtralDataCollator<br/>Audio + Text Processing"]
|
363 |
+
end
|
364 |
+
|
365 |
+
subgraph "Training Scripts"
|
366 |
+
TRAIN_FULL["Full Fine-tuning<br/>scripts/train.py"]
|
367 |
+
TRAIN_LORA["LoRA Fine-tuning<br/>scripts/train_lora.py"]
|
368 |
+
|
369 |
+
subgraph "Training Components"
|
370 |
+
MODEL_INIT["Model Initialization<br/>VoxtralForConditionalGeneration"]
|
371 |
+
LORA_CONFIG["LoRA Configuration<br/>LoraConfig + get_peft_model"]
|
372 |
+
PROCESSOR_INIT["Processor Initialization<br/>VoxtralProcessor"]
|
373 |
+
end
|
374 |
+
end
|
375 |
+
|
376 |
+
subgraph "Training Infrastructure"
|
377 |
+
TRACKIO_INIT["Trackio Integration<br/>Experiment Tracking"]
|
378 |
+
HF_TRAINER["Hugging Face Trainer<br/>TrainingArguments + Trainer"]
|
379 |
+
TORCH_DEVICE["Torch Device Setup<br/>GPU/CPU Detection"]
|
380 |
+
end
|
381 |
+
|
382 |
+
subgraph "Training Process"
|
383 |
+
FORWARD_PASS["Forward Pass<br/>Audio Processing + Generation"]
|
384 |
+
LOSS_CALC["Loss Calculation<br/>Masked Language Modeling"]
|
385 |
+
BACKWARD_PASS["Backward Pass<br/>Gradient Computation"]
|
386 |
+
OPTIMIZER_STEP["Optimizer Step<br/>Parameter Updates"]
|
387 |
+
LOGGING["Metrics Logging<br/>Loss, Perplexity, etc."]
|
388 |
+
end
|
389 |
+
|
390 |
+
subgraph "Model Management"
|
391 |
+
CHECKPOINT_SAVING["Checkpoint Saving<br/>Model snapshots"]
|
392 |
+
MODEL_SAVING["Final Model Saving<br/>Processor + Model"]
|
393 |
+
LOCAL_STORAGE["Local Storage<br/>outputs/ directory"]
|
394 |
+
end
|
395 |
+
|
396 |
+
LOADER --> CASTER
|
397 |
+
CASTER --> COLLATOR
|
398 |
+
|
399 |
+
COLLATOR --> TRAIN_FULL
|
400 |
+
COLLATOR --> TRAIN_LORA
|
401 |
+
|
402 |
+
TRAIN_FULL --> MODEL_INIT
|
403 |
+
TRAIN_LORA --> MODEL_INIT
|
404 |
+
TRAIN_LORA --> LORA_CONFIG
|
405 |
+
|
406 |
+
MODEL_INIT --> PROCESSOR_INIT
|
407 |
+
LORA_CONFIG --> PROCESSOR_INIT
|
408 |
+
|
409 |
+
PROCESSOR_INIT --> TRACKIO_INIT
|
410 |
+
PROCESSOR_INIT --> HF_TRAINER
|
411 |
+
PROCESSOR_INIT --> TORCH_DEVICE
|
412 |
+
|
413 |
+
TRACKIO_INIT --> HF_TRAINER
|
414 |
+
TORCH_DEVICE --> HF_TRAINER
|
415 |
+
|
416 |
+
HF_TRAINER --> FORWARD_PASS
|
417 |
+
FORWARD_PASS --> LOSS_CALC
|
418 |
+
LOSS_CALC --> BACKWARD_PASS
|
419 |
+
BACKWARD_PASS --> OPTIMIZER_STEP
|
420 |
+
OPTIMIZER_STEP --> LOGGING
|
421 |
+
|
422 |
+
LOGGING --> CHECKPOINT_SAVING
|
423 |
+
LOGGING --> TRACKIO_INIT
|
424 |
+
|
425 |
+
HF_TRAINER --> MODEL_SAVING
|
426 |
+
MODEL_SAVING --> LOCAL_STORAGE
|
427 |
+
|
428 |
+
JSONL --> LOADER
|
429 |
+
GRANARY --> LOADER
|
430 |
+
HFDATA --> LOADER
|
431 |
+
|
432 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
433 |
+
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
434 |
+
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
435 |
+
classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
436 |
+
classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
437 |
+
classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
438 |
+
|
439 |
+
class JSONL,GRANARY,HFDATA input
|
440 |
+
class LOADER,CASTER,COLLATOR processing
|
441 |
+
class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
|
442 |
+
class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
|
443 |
+
class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
|
444 |
+
class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
|
445 |
+
</div>
|
446 |
+
</div>
|
447 |
+
|
448 |
+
<div id="deployment" class="diagram-container">
|
449 |
+
<div class="diagram-title">Deployment Pipeline</div>
|
450 |
+
<div class="diagram-description">
|
451 |
+
Model publishing and demo deployment process from trained model to live interactive demo.
|
452 |
+
</div>
|
453 |
+
<div class="mermaid">
|
454 |
+
graph TB
|
455 |
+
subgraph "Inputs"
|
456 |
+
TRAINED_MODEL["Trained Model<br/>Local directory"]
|
457 |
+
TRAINING_CONFIG["Training Config<br/>JSON/YAML"]
|
458 |
+
TRAINING_RESULTS["Training Results<br/>Metrics & logs"]
|
459 |
+
MODEL_METADATA["Model Metadata<br/>Name, description, etc."]
|
460 |
+
end
|
461 |
+
|
462 |
+
subgraph "Model Publishing"
|
463 |
+
PUSH_SCRIPT["push_to_huggingface.py<br/>Model Publisher"]
|
464 |
+
|
465 |
+
subgraph "Publishing Steps"
|
466 |
+
REPO_CREATION["Repository Creation<br/>HF Hub API"]
|
467 |
+
FILE_UPLOAD["File Upload<br/>Model files to HF"]
|
468 |
+
METADATA_UPLOAD["Metadata Upload<br/>Config & results"]
|
469 |
+
end
|
470 |
+
end
|
471 |
+
|
472 |
+
subgraph "Model Card Generation"
|
473 |
+
CARD_SCRIPT["generate_model_card.py<br/>Card Generator"]
|
474 |
+
|
475 |
+
subgraph "Card Components"
|
476 |
+
TEMPLATE_LOAD["Template Loading<br/>model_card.md"]
|
477 |
+
VARIABLE_REPLACEMENT["Variable Replacement<br/>Config injection"]
|
478 |
+
CONDITIONAL_PROCESSING["Conditional Sections<br/>Quantized models, etc."]
|
479 |
+
end
|
480 |
+
end
|
481 |
+
|
482 |
+
subgraph "Demo Space Deployment"
|
483 |
+
DEPLOY_SCRIPT["deploy_demo_space.py<br/>Space Deployer"]
|
484 |
+
|
485 |
+
subgraph "Space Setup"
|
486 |
+
SPACE_CREATION["Space Repository<br/>Create HF Space"]
|
487 |
+
TEMPLATE_COPY["Template Copying<br/>demo_voxtral/ files"]
|
488 |
+
ENV_INJECTION["Environment Setup<br/>Model config injection"]
|
489 |
+
SECRET_SETUP["Secret Configuration<br/>HF_TOKEN, model vars"]
|
490 |
+
end
|
491 |
+
end
|
492 |
+
|
493 |
+
subgraph "Space Building"
|
494 |
+
BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
|
495 |
+
DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
|
496 |
+
MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
|
497 |
+
APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
|
498 |
+
end
|
499 |
+
|
500 |
+
subgraph "Live Demo Space"
|
501 |
+
GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
|
502 |
+
MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
|
503 |
+
USER_INTERACTION[User Interaction<br/>Audio upload/playback]
|
504 |
+
end
|
505 |
+
|
506 |
+
subgraph "External Services"
|
507 |
+
HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
|
508 |
+
HF_SPACES[HF Spaces Platform<br/>Demo hosting]
|
509 |
+
end
|
510 |
+
|
511 |
+
TRAINED_MODEL --> PUSH_SCRIPT
|
512 |
+
TRAINING_CONFIG --> PUSH_SCRIPT
|
513 |
+
TRAINING_RESULTS --> PUSH_SCRIPT
|
514 |
+
MODEL_METADATA --> PUSH_SCRIPT
|
515 |
+
|
516 |
+
PUSH_SCRIPT --> REPO_CREATION
|
517 |
+
REPO_CREATION --> FILE_UPLOAD
|
518 |
+
FILE_UPLOAD --> METADATA_UPLOAD
|
519 |
+
|
520 |
+
METADATA_UPLOAD --> CARD_SCRIPT
|
521 |
+
TRAINING_CONFIG --> CARD_SCRIPT
|
522 |
+
TRAINING_RESULTS --> CARD_SCRIPT
|
523 |
+
|
524 |
+
CARD_SCRIPT --> TEMPLATE_LOAD
|
525 |
+
TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
|
526 |
+
VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
|
527 |
+
|
528 |
+
CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
|
529 |
+
METADATA_UPLOAD --> DEPLOY_SCRIPT
|
530 |
+
|
531 |
+
DEPLOY_SCRIPT --> SPACE_CREATION
|
532 |
+
SPACE_CREATION --> TEMPLATE_COPY
|
533 |
+
TEMPLATE_COPY --> ENV_INJECTION
|
534 |
+
ENV_INJECTION --> SECRET_SETUP
|
535 |
+
|
536 |
+
SECRET_SETUP --> BUILD_TRIGGER
|
537 |
+
BUILD_TRIGGER --> DEPENDENCY_INSTALL
|
538 |
+
DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
|
539 |
+
MODEL_DOWNLOAD --> APP_INITIALIZATION
|
540 |
+
|
541 |
+
APP_INITIALIZATION --> GRADIO_INTERFACE
|
542 |
+
GRADIO_INTERFACE --> MODEL_INFERENCE
|
543 |
+
MODEL_INFERENCE --> USER_INTERACTION
|
544 |
+
|
545 |
+
HF_HUB --> MODEL_DOWNLOAD
|
546 |
+
HF_SPACES --> GRADIO_INTERFACE
|
547 |
+
|
548 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
549 |
+
classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
550 |
+
classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
551 |
+
classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
552 |
+
classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
553 |
+
classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
554 |
+
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
555 |
+
|
556 |
+
class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
|
557 |
+
class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
|
558 |
+
class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
|
559 |
+
class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
|
560 |
+
class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
|
561 |
+
class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
|
562 |
+
class HF_HUB,HF_SPACES external
|
563 |
+
</div>
|
564 |
+
</div>
|
565 |
+
|
566 |
+
<div id="dataflow" class="diagram-container">
|
567 |
+
<div class="diagram-title">Data Flow</div>
|
568 |
+
<div class="diagram-description">
|
569 |
+
Complete data journey through the Voxtral ASR Fine-tuning application from user input to deployed demo.
|
570 |
+
</div>
|
571 |
+
<div class="mermaid">
|
572 |
+
flowchart TD
|
573 |
+
subgraph "User Input"
|
574 |
+
MIC["Microphone Recording<br/>Raw audio + timestamps"]
|
575 |
+
FILE["File Upload<br/>WAV/FLAC files"]
|
576 |
+
TEXT["Manual Transcripts<br/>Text input"]
|
577 |
+
LANG["Language Selection<br/>25+ languages"]
|
578 |
+
end
|
579 |
+
|
580 |
+
subgraph "Data Processing"
|
581 |
+
AUDIO_PROC["Audio Processing<br/>Resampling to 16kHz<br/>Format conversion"]
|
582 |
+
TEXT_PROC["Text Processing<br/>Transcript validation<br/>Cleaning & formatting"]
|
583 |
+
JSONL_CONV["JSONL Conversion<br/>{'audio_path': '...', 'text': '...'}"]
|
584 |
+
end
|
585 |
+
|
586 |
+
subgraph "Dataset Storage"
|
587 |
+
LOCAL_DS["Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/"]
|
588 |
+
HF_DS["HF Hub Dataset<br/>username/dataset-name<br/>Public sharing"]
|
589 |
+
end
|
590 |
+
|
591 |
+
subgraph "Training Data Pipeline"
|
592 |
+
DS_LOADER["Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()"]
|
593 |
+
AUDIO_CAST["Audio Casting<br/>Audio(sampling_rate=16000)"]
|
594 |
+
TRAIN_SPLIT["Train Split<br/>train_dataset"]
|
595 |
+
EVAL_SPLIT["Eval Split<br/>eval_dataset"]
|
596 |
+
end
|
597 |
+
|
598 |
+
subgraph "Model Training"
|
599 |
+
COLLATOR["VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction"]
|
600 |
+
FORWARD["Forward Pass<br/>Audio β Features β Text"]
|
601 |
+
LOSS["Loss Calculation<br/>Masked LM loss"]
|
602 |
+
BACKWARD["Backward Pass<br/>Gradient computation"]
|
603 |
+
OPTIMIZE["Parameter Updates<br/>LoRA or full fine-tuning"]
|
604 |
+
end
|
605 |
+
|
606 |
+
subgraph "Training Outputs"
|
607 |
+
MODEL_FILES["Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json"]
|
608 |
+
TRAINING_LOGS["Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves"]
|
609 |
+
CHECKPOINTS["Checkpoints<br/>Intermediate models<br/>best model tracking"]
|
610 |
+
end
|
611 |
+
|
612 |
+
subgraph "Publishing Pipeline"
|
613 |
+
HF_REPO["HF Repository<br/>username/model-name<br/>Model hosting"]
|
614 |
+
MODEL_CARD["Model Card<br/>README.md<br/>Training details<br/>Usage examples"]
|
615 |
+
METADATA["Training Metadata<br/>Config + results<br/>Performance metrics"]
|
616 |
+
end
|
617 |
+
|
618 |
+
subgraph "Demo Deployment"
|
619 |
+
SPACE_REPO["HF Space Repository<br/>username/model-name-demo<br/>Demo hosting"]
|
620 |
+
DEMO_APP["Demo Application<br/>Gradio interface<br/>Real-time inference"]
|
621 |
+
ENV_VARS["Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets"]
|
622 |
+
end
|
623 |
+
|
624 |
+
MIC --> AUDIO_PROC
|
625 |
+
FILE --> AUDIO_PROC
|
626 |
+
TEXT --> TEXT_PROC
|
627 |
+
LANG --> TEXT_PROC
|
628 |
+
|
629 |
+
AUDIO_PROC --> JSONL_CONV
|
630 |
+
TEXT_PROC --> JSONL_CONV
|
631 |
+
|
632 |
+
JSONL_CONV --> LOCAL_DS
|
633 |
+
LOCAL_DS --> HF_DS
|
634 |
+
|
635 |
+
LOCAL_DS --> DS_LOADER
|
636 |
+
HF_DS --> DS_LOADER
|
637 |
+
|
638 |
+
DS_LOADER --> AUDIO_CAST
|
639 |
+
AUDIO_CAST --> TRAIN_SPLIT
|
640 |
+
AUDIO_CAST --> EVAL_SPLIT
|
641 |
+
|
642 |
+
TRAIN_SPLIT --> COLLATOR
|
643 |
+
EVAL_SPLIT --> COLLATOR
|
644 |
+
|
645 |
+
COLLATOR --> FORWARD
|
646 |
+
FORWARD --> LOSS
|
647 |
+
LOSS --> BACKWARD
|
648 |
+
BACKWARD --> OPTIMIZE
|
649 |
+
|
650 |
+
OPTIMIZE --> MODEL_FILES
|
651 |
+
OPTIMIZE --> TRAINING_LOGS
|
652 |
+
OPTIMIZE --> CHECKPOINTS
|
653 |
+
|
654 |
+
MODEL_FILES --> HF_REPO
|
655 |
+
TRAINING_LOGS --> HF_REPO
|
656 |
+
CHECKPOINTS --> HF_REPO
|
657 |
+
|
658 |
+
HF_REPO --> MODEL_CARD
|
659 |
+
TRAINING_LOGS --> MODEL_CARD
|
660 |
+
|
661 |
+
MODEL_CARD --> SPACE_REPO
|
662 |
+
HF_REPO --> SPACE_REPO
|
663 |
+
ENV_VARS --> SPACE_REPO
|
664 |
+
|
665 |
+
SPACE_REPO --> DEMO_APP
|
666 |
+
|
667 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
668 |
+
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
669 |
+
classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
670 |
+
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
671 |
+
classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
672 |
+
classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
673 |
+
classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
674 |
+
|
675 |
+
class MIC,FILE,TEXT,LANG input
|
676 |
+
class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
|
677 |
+
class LOCAL_DS,HF_DS storage
|
678 |
+
class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
|
679 |
+
class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
|
680 |
+
class HF_REPO,MODEL_CARD,METADATA publishing
|
681 |
+
class SPACE_REPO,DEMO_APP,ENV_VARS deployment
|
682 |
+
</div>
|
683 |
+
</div>
|
684 |
+
|
685 |
+
<script>
|
686 |
+
// Toggle mermaid code visibility
|
687 |
+
function toggleCode(diagramId) {
|
688 |
+
const codeBlock = document.querySelector(`#${diagramId} .mermaid-code`);
|
689 |
+
if (codeBlock.style.display === 'none' || codeBlock.style.display === '') {
|
690 |
+
codeBlock.style.display = 'block';
|
691 |
+
} else {
|
692 |
+
codeBlock.style.display = 'none';
|
693 |
+
}
|
694 |
+
}
|
695 |
+
|
696 |
+
// Add toggle buttons to each diagram
|
697 |
+
document.addEventListener('DOMContentLoaded', function() {
|
698 |
+
const diagrams = document.querySelectorAll('.diagram-container');
|
699 |
+
diagrams.forEach((diagram, index) => {
|
700 |
+
const diagramId = diagram.id;
|
701 |
+
const mermaidDiv = diagram.querySelector('.mermaid');
|
702 |
+
|
703 |
+
if (mermaidDiv) {
|
704 |
+
// Create toggle button
|
705 |
+
const toggleBtn = document.createElement('button');
|
706 |
+
toggleBtn.className = 'code-toggle';
|
707 |
+
toggleBtn.textContent = 'π Show Mermaid Code';
|
708 |
+
toggleBtn.onclick = () => toggleCode(diagramId);
|
709 |
+
|
710 |
+
// Create code block
|
711 |
+
const codeBlock = document.createElement('pre');
|
712 |
+
codeBlock.className = 'mermaid-code';
|
713 |
+
codeBlock.textContent = mermaidDiv.textContent.trim();
|
714 |
+
|
715 |
+
// Insert elements
|
716 |
+
mermaidDiv.parentNode.insertBefore(toggleBtn, mermaidDiv);
|
717 |
+
mermaidDiv.parentNode.insertBefore(codeBlock, mermaidDiv.nextSibling);
|
718 |
+
}
|
719 |
+
});
|
720 |
+
});
|
721 |
+
|
722 |
+
// Print functionality
|
723 |
+
function printDiagrams() {
|
724 |
+
window.print();
|
725 |
+
}
|
726 |
+
</script>
|
727 |
+
</body>
|
728 |
+
</html>
|
docs/interface-workflow.md
ADDED
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Interface Workflow
|
2 |
+
|
3 |
+
```mermaid
|
4 |
+
stateDiagram-v2
|
5 |
+
[*] --> LanguageSelection: User opens interface
|
6 |
+
|
7 |
+
state "Language & Dataset Setup" as LangSetup {
|
8 |
+
[*] --> LanguageSelection
|
9 |
+
LanguageSelection --> LoadPhrases: Select language
|
10 |
+
LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
|
11 |
+
DisplayPhrases --> RecordingInterface: Show phrases & recording UI
|
12 |
+
|
13 |
+
state RecordingInterface {
|
14 |
+
[*] --> ShowInitialRows: Display first 10 phrases
|
15 |
+
ShowInitialRows --> RecordAudio: User can record audio
|
16 |
+
RecordAudio --> AddMoreRows: Optional - add 10 more rows
|
17 |
+
AddMoreRows --> RecordAudio
|
18 |
+
}
|
19 |
+
}
|
20 |
+
|
21 |
+
RecordingInterface --> DatasetCreation: User finishes recording
|
22 |
+
|
23 |
+
state "Dataset Creation Options" as DatasetCreation {
|
24 |
+
[*] --> FromRecordings: Create from recorded audio
|
25 |
+
[*] --> FromUploads: Upload existing files
|
26 |
+
|
27 |
+
FromRecordings --> ProcessRecordings: Save WAV files + transcripts
|
28 |
+
FromUploads --> ProcessUploads: Process uploaded files + transcripts
|
29 |
+
|
30 |
+
ProcessRecordings --> CreateJSONL: Generate JSONL dataset
|
31 |
+
ProcessUploads --> CreateJSONL
|
32 |
+
|
33 |
+
CreateJSONL --> DatasetReady: Dataset saved locally
|
34 |
+
}
|
35 |
+
|
36 |
+
DatasetCreation --> TrainingConfiguration: Dataset ready
|
37 |
+
|
38 |
+
state "Training Setup" as TrainingConfiguration {
|
39 |
+
[*] --> BasicSettings: Model, LoRA/full, batch size
|
40 |
+
[*] --> AdvancedSettings: Learning rate, epochs, LoRA params
|
41 |
+
|
42 |
+
BasicSettings --> ConfigureDeployment: Repo name, push options
|
43 |
+
AdvancedSettings --> ConfigureDeployment
|
44 |
+
|
45 |
+
ConfigureDeployment --> StartTraining: All settings configured
|
46 |
+
}
|
47 |
+
|
48 |
+
TrainingConfiguration --> TrainingProcess: Start training
|
49 |
+
|
50 |
+
state "Training Process" as TrainingProcess {
|
51 |
+
[*] --> InitializeTrackio: Setup experiment tracking
|
52 |
+
InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
|
53 |
+
RunTrainingScript --> StreamLogs: Show real-time training logs
|
54 |
+
StreamLogs --> MonitorProgress: Track metrics & checkpoints
|
55 |
+
|
56 |
+
MonitorProgress --> TrainingComplete: Training finished
|
57 |
+
MonitorProgress --> HandleErrors: Training failed
|
58 |
+
HandleErrors --> RetryOrExit: User can retry or exit
|
59 |
+
}
|
60 |
+
|
61 |
+
TrainingProcess --> PostTraining: Training complete
|
62 |
+
|
63 |
+
state "Post-Training Actions" as PostTraining {
|
64 |
+
[*] --> PushToHub: Push model to HF Hub
|
65 |
+
[*] --> GenerateModelCard: Create model card
|
66 |
+
[*] --> DeployDemoSpace: Deploy interactive demo
|
67 |
+
|
68 |
+
PushToHub --> ModelPublished: Model available on HF Hub
|
69 |
+
GenerateModelCard --> ModelDocumented: Model card created
|
70 |
+
DeployDemoSpace --> DemoReady: Demo space deployed
|
71 |
+
}
|
72 |
+
|
73 |
+
PostTraining --> [*]: Process complete
|
74 |
+
|
75 |
+
%% Alternative paths
|
76 |
+
DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
|
77 |
+
PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
|
78 |
+
|
79 |
+
%% Error handling
|
80 |
+
TrainingProcess --> ErrorRecovery: Handle training errors
|
81 |
+
ErrorRecovery --> RetryTraining: Retry with different settings
|
82 |
+
RetryTraining --> TrainingConfiguration
|
83 |
+
|
84 |
+
%% Styling and notes
|
85 |
+
note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset
|
86 |
+
note right of RecordingInterface : Users record themselves<br/>reading displayed phrases
|
87 |
+
note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
|
88 |
+
note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc.
|
89 |
+
note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration
|
90 |
+
note right of PostTraining : Automated deployment<br/>pipeline
|
91 |
+
```
|
92 |
+
|
93 |
+
## Interface Workflow Overview
|
94 |
+
|
95 |
+
This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process.
|
96 |
+
|
97 |
+
### Key Workflow Stages
|
98 |
+
|
99 |
+
#### 1. Language & Dataset Setup
|
100 |
+
- **Language Selection**: Users choose from 25+ European languages supported by NVIDIA Granary
|
101 |
+
- **Phrase Loading**: System loads authentic, high-quality phrases in the selected language
|
102 |
+
- **Recording Interface**: Dynamic interface showing phrases with audio recording components
|
103 |
+
- **Progressive Disclosure**: Users can add more rows as needed (up to 100 recordings)
|
104 |
+
|
105 |
+
#### 2. Dataset Creation
|
106 |
+
- **From Recordings**: Process microphone recordings into WAV files and JSONL dataset
|
107 |
+
- **From Uploads**: Handle existing WAV/FLAC files with manual transcripts
|
108 |
+
- **JSONL Format**: Standard format with `audio_path` and `text` fields
|
109 |
+
- **Local Storage**: Datasets stored in `datasets/voxtral_user/` directory
|
110 |
+
|
111 |
+
#### 3. Training Configuration
|
112 |
+
- **Basic Settings**: Model selection, LoRA vs full fine-tuning, batch size
|
113 |
+
- **Advanced Settings**: Learning rate, epochs, gradient accumulation
|
114 |
+
- **LoRA Parameters**: r, alpha, dropout, audio tower freezing options
|
115 |
+
- **Repository Setup**: Model naming and Hugging Face Hub integration
|
116 |
+
|
117 |
+
#### 4. Training Process
|
118 |
+
- **Trackio Integration**: Automatic experiment tracking setup
|
119 |
+
- **Script Execution**: Calls appropriate training script (`train.py` or `train_lora.py`)
|
120 |
+
- **Log Streaming**: Real-time display of training progress and metrics
|
121 |
+
- **Error Handling**: Graceful handling of training failures with retry options
|
122 |
+
|
123 |
+
#### 5. Post-Training Actions
|
124 |
+
- **Model Publishing**: Automatic push to Hugging Face Hub
|
125 |
+
- **Model Card Generation**: Automated creation using `generate_model_card.py`
|
126 |
+
- **Demo Deployment**: One-click deployment of interactive demo spaces
|
127 |
+
|
128 |
+
### Alternative Paths
|
129 |
+
|
130 |
+
#### Dataset-Only Workflow
|
131 |
+
- Users can create and publish datasets without training models
|
132 |
+
- Useful for dataset curation and sharing
|
133 |
+
|
134 |
+
#### Error Recovery
|
135 |
+
- Training failures trigger error recovery flows
|
136 |
+
- Users can retry with modified parameters
|
137 |
+
- Comprehensive error logging and debugging information
|
138 |
+
|
139 |
+
### Technical Integration Points
|
140 |
+
|
141 |
+
#### External Services
|
142 |
+
- **NVIDIA Granary**: Source of high-quality multilingual ASR data
|
143 |
+
- **Hugging Face Hub**: Model and dataset storage and sharing
|
144 |
+
- **Trackio Spaces**: Experiment tracking and visualization
|
145 |
+
|
146 |
+
#### Script Integration
|
147 |
+
- **interface.py**: Main Gradio application orchestrating the workflow
|
148 |
+
- **train.py/train_lora.py**: Core training scripts with Trackio integration
|
149 |
+
- **push_to_huggingface.py**: Model/dataset publishing
|
150 |
+
- **deploy_demo_space.py**: Automated demo deployment
|
151 |
+
- **generate_model_card.py**: Model documentation generation
|
152 |
+
|
153 |
+
### User Experience Features
|
154 |
+
|
155 |
+
#### Progressive Interface Reveal
|
156 |
+
- Interface components are revealed as users progress through workflow
|
157 |
+
- Reduces cognitive load and guides users step-by-step
|
158 |
+
|
159 |
+
#### Real-time Feedback
|
160 |
+
- Live log streaming during training
|
161 |
+
- Progress indicators and status updates
|
162 |
+
- Immediate feedback on dataset creation and validation
|
163 |
+
|
164 |
+
#### Flexible Input Methods
|
165 |
+
- Support for both live recording and file uploads
|
166 |
+
- Multiple language options for diverse user needs
|
167 |
+
- Scalable recording interface (10-100 samples)
|
168 |
+
|
169 |
+
See also:
|
170 |
+
- [Architecture Overview](architecture.md)
|
171 |
+
- [Training Pipeline](training-pipeline.md)
|
172 |
+
- [Data Flow](data-flow.md)
|
173 |
+
|
docs/training-pipeline.md
ADDED
@@ -0,0 +1,271 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Training Pipeline
|
2 |
+
|
3 |
+
```mermaid
|
4 |
+
graph TB
|
5 |
+
%% Input Data Sources
|
6 |
+
subgraph "Data Sources"
|
7 |
+
JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
|
8 |
+
GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
|
9 |
+
HFDATA[HF Hub Datasets<br/>Community Datasets]
|
10 |
+
end
|
11 |
+
|
12 |
+
%% Data Processing
|
13 |
+
subgraph "Data Processing"
|
14 |
+
LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
|
15 |
+
CASTER[Audio Casting<br/>16kHz resampling]
|
16 |
+
COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
|
17 |
+
end
|
18 |
+
|
19 |
+
%% Training Scripts
|
20 |
+
subgraph "Training Scripts"
|
21 |
+
TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
|
22 |
+
TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]
|
23 |
+
|
24 |
+
subgraph "Training Components"
|
25 |
+
MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
|
26 |
+
LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
|
27 |
+
PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
|
28 |
+
end
|
29 |
+
end
|
30 |
+
|
31 |
+
%% Training Infrastructure
|
32 |
+
subgraph "Training Infrastructure"
|
33 |
+
TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
|
34 |
+
HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
|
35 |
+
TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
|
36 |
+
end
|
37 |
+
|
38 |
+
%% Training Process
|
39 |
+
subgraph "Training Process"
|
40 |
+
FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
|
41 |
+
LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
|
42 |
+
BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
|
43 |
+
OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
|
44 |
+
LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
|
45 |
+
end
|
46 |
+
|
47 |
+
%% Model Management
|
48 |
+
subgraph "Model Management"
|
49 |
+
CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
|
50 |
+
MODEL_SAVING[Final Model Saving<br/>Processor + Model]
|
51 |
+
LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
|
52 |
+
end
|
53 |
+
|
54 |
+
%% Flow Connections
|
55 |
+
JSONL --> LOADER
|
56 |
+
GRANARY --> LOADER
|
57 |
+
HFDATA --> LOADER
|
58 |
+
|
59 |
+
LOADER --> CASTER
|
60 |
+
CASTER --> COLLATOR
|
61 |
+
|
62 |
+
COLLATOR --> TRAIN_FULL
|
63 |
+
COLLATOR --> TRAIN_LORA
|
64 |
+
|
65 |
+
TRAIN_FULL --> MODEL_INIT
|
66 |
+
TRAIN_LORA --> MODEL_INIT
|
67 |
+
TRAIN_LORA --> LORA_CONFIG
|
68 |
+
|
69 |
+
MODEL_INIT --> PROCESSOR_INIT
|
70 |
+
LORA_CONFIG --> PROCESSOR_INIT
|
71 |
+
|
72 |
+
PROCESSOR_INIT --> TRACKIO_INIT
|
73 |
+
PROCESSOR_INIT --> HF_TRAINER
|
74 |
+
PROCESSOR_INIT --> TORCH_DEVICE
|
75 |
+
|
76 |
+
TRACKIO_INIT --> HF_TRAINER
|
77 |
+
TORCH_DEVICE --> HF_TRAINER
|
78 |
+
|
79 |
+
HF_TRAINER --> FORWARD_PASS
|
80 |
+
FORWARD_PASS --> LOSS_CALC
|
81 |
+
LOSS_CALC --> BACKWARD_PASS
|
82 |
+
BACKWARD_PASS --> OPTIMIZER_STEP
|
83 |
+
OPTIMIZER_STEP --> LOGGING
|
84 |
+
|
85 |
+
LOGGING --> CHECKPOINT_SAVING
|
86 |
+
LOGGING --> TRACKIO_INIT
|
87 |
+
|
88 |
+
HF_TRAINER --> MODEL_SAVING
|
89 |
+
MODEL_SAVING --> LOCAL_STORAGE
|
90 |
+
|
91 |
+
%% Styling
|
92 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
93 |
+
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
94 |
+
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
95 |
+
classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
96 |
+
classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
97 |
+
classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
98 |
+
|
99 |
+
class JSONL,GRANARY,HFDATA input
|
100 |
+
class LOADER,CASTER,COLLATOR processing
|
101 |
+
class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
|
102 |
+
class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
|
103 |
+
class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
|
104 |
+
class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
|
105 |
+
```
|
106 |
+
|
107 |
+
## Training Pipeline Overview
|
108 |
+
|
109 |
+
This diagram illustrates the complete training pipeline for Voxtral ASR fine-tuning, showing how data flows through the training scripts and supporting infrastructure.
|
110 |
+
|
111 |
+
### Data Input Sources
|
112 |
+
|
113 |
+
#### JSONL Datasets
|
114 |
+
- **Local Datasets**: User-created datasets from recordings or uploads
|
115 |
+
- **Format**: `{"audio_path": "path/to/audio.wav", "text": "transcription"}`
|
116 |
+
- **Processing**: Loaded via `_load_jsonl_dataset()` function
|
117 |
+
|
118 |
+
#### NVIDIA Granary Dataset
|
119 |
+
- **Multilingual Support**: 25+ European languages
|
120 |
+
- **High Quality**: Curated ASR training data
|
121 |
+
- **Streaming**: Efficient loading without full download
|
122 |
+
|
123 |
+
#### Hugging Face Hub Datasets
|
124 |
+
- **Community Datasets**: Public datasets from HF Hub
|
125 |
+
- **Standard Formats**: Compatible with Voxtral training requirements
|
126 |
+
|
127 |
+
### Data Processing Pipeline
|
128 |
+
|
129 |
+
#### Dataset Loading
|
130 |
+
```python
|
131 |
+
# Load local JSONL or HF dataset
|
132 |
+
ds = _load_jsonl_dataset(jsonl_path)
|
133 |
+
# or
|
134 |
+
ds = load_dataset(ds_name, ds_cfg, split="test")
|
135 |
+
```
|
136 |
+
|
137 |
+
#### Audio Processing
|
138 |
+
```python
|
139 |
+
# Cast to Audio format with 16kHz resampling
|
140 |
+
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
|
141 |
+
```
|
142 |
+
|
143 |
+
#### Data Collation
|
144 |
+
- **VoxtralDataCollator**: Custom collator for Voxtral training
|
145 |
+
- **Audio Processing**: Converts audio to model inputs
|
146 |
+
- **Text Tokenization**: Processes transcription text
|
147 |
+
- **Masking**: Masks prompt tokens during training
|
148 |
+
|
149 |
+
### Training Script Architecture
|
150 |
+
|
151 |
+
#### Full Fine-tuning (`train.py`)
|
152 |
+
- **Complete Model Updates**: All parameters trainable
|
153 |
+
- **Higher Memory Requirements**: Full model in memory
|
154 |
+
- **Better Convergence**: Can achieve higher accuracy
|
155 |
+
|
156 |
+
#### LoRA Fine-tuning (`train_lora.py`)
|
157 |
+
- **Parameter Efficient**: Only LoRA adapters trained
|
158 |
+
- **Lower Memory Usage**: Base model frozen
|
159 |
+
- **Faster Training**: Fewer parameters to update
|
160 |
+
- **Configurable**: r, alpha, dropout parameters
|
161 |
+
|
162 |
+
### Training Infrastructure
|
163 |
+
|
164 |
+
#### Trackio Integration
|
165 |
+
```python
|
166 |
+
trackio.init(
|
167 |
+
project="voxtral-finetuning",
|
168 |
+
config={...}, # Training parameters
|
169 |
+
space_id=trackio_space
|
170 |
+
)
|
171 |
+
```
|
172 |
+
|
173 |
+
#### Hugging Face Trainer
|
174 |
+
```python
|
175 |
+
training_args = TrainingArguments(
|
176 |
+
output_dir=output_dir,
|
177 |
+
per_device_train_batch_size=batch_size,
|
178 |
+
learning_rate=learning_rate,
|
179 |
+
num_train_epochs=epochs,
|
180 |
+
bf16=True, # BFloat16 for efficiency
|
181 |
+
report_to=["trackio"],
|
182 |
+
# ... other args
|
183 |
+
)
|
184 |
+
```
|
185 |
+
|
186 |
+
#### Device Management
|
187 |
+
- **GPU Detection**: Automatic CUDA/GPU detection
|
188 |
+
- **Fallback**: CPU training if no GPU available
|
189 |
+
- **Memory Optimization**: Model sharding and gradient checkpointing
|
190 |
+
|
191 |
+
### Training Process Flow
|
192 |
+
|
193 |
+
#### Forward Pass
|
194 |
+
1. **Audio Input**: Raw audio waveforms
|
195 |
+
2. **Audio Tower**: Audio feature extraction
|
196 |
+
3. **Text Generation**: Autoregressive text generation from audio features
|
197 |
+
|
198 |
+
#### Loss Calculation
|
199 |
+
- **Masked Language Modeling**: Only transcription tokens contribute to loss
|
200 |
+
- **Audio Prompt Masking**: Audio processing tokens are masked out
|
201 |
+
- **Cross-Entropy Loss**: Standard language modeling loss
|
202 |
+
|
203 |
+
#### Backward Pass & Optimization
|
204 |
+
- **Gradient Computation**: Backpropagation through the model
|
205 |
+
- **LoRA Updates**: Only adapter parameters updated (LoRA mode)
|
206 |
+
- **Full Updates**: All parameters updated (full fine-tuning)
|
207 |
+
|
208 |
+
### Model Management
|
209 |
+
|
210 |
+
#### Checkpoint Saving
|
211 |
+
- **Regular Checkpoints**: Saved every N steps
|
212 |
+
- **Best Model Tracking**: Save best model based on validation loss
|
213 |
+
- **Resume Capability**: Continue training from checkpoints
|
214 |
+
|
215 |
+
#### Final Model Saving
|
216 |
+
```python
|
217 |
+
trainer.save_model() # Saves model and tokenizer
|
218 |
+
processor.save_pretrained(output_dir) # Saves processor
|
219 |
+
```
|
220 |
+
|
221 |
+
#### Local Storage Structure
|
222 |
+
```
|
223 |
+
outputs/
|
224 |
+
βββ voxtral-finetuned-{timestamp}/
|
225 |
+
β βββ config.json
|
226 |
+
β βββ model.safetensors
|
227 |
+
β βββ tokenizer.json
|
228 |
+
β βββ training_config.json
|
229 |
+
β βββ train_results.json
|
230 |
+
β βββ eval_results.json
|
231 |
+
```
|
232 |
+
|
233 |
+
### Integration Points
|
234 |
+
|
235 |
+
#### With Interface (`interface.py`)
|
236 |
+
- **Parameter Passing**: Training parameters from UI
|
237 |
+
- **Log Streaming**: Real-time training logs to UI
|
238 |
+
- **Progress Monitoring**: Training progress updates
|
239 |
+
|
240 |
+
#### With Model Publishing (`push_to_huggingface.py`)
|
241 |
+
- **Model Upload**: Trained model to HF Hub
|
242 |
+
- **Metadata**: Training config and results
|
243 |
+
- **Model Cards**: Automatic model card generation
|
244 |
+
|
245 |
+
#### With Demo Deployment (`deploy_demo_space.py`)
|
246 |
+
- **Space Creation**: HF Spaces for demos
|
247 |
+
- **Model Integration**: Deploy trained model in demo
|
248 |
+
- **Configuration**: Demo-specific settings
|
249 |
+
|
250 |
+
### Performance Considerations
|
251 |
+
|
252 |
+
#### Memory Optimization
|
253 |
+
- **LoRA**: Significantly reduces memory requirements
|
254 |
+
- **Gradient Checkpointing**: Trade compute for memory
|
255 |
+
- **Mixed Precision**: BF16/FP16 training
|
256 |
+
|
257 |
+
#### Training Efficiency
|
258 |
+
- **Batch Size**: Balanced with gradient accumulation
|
259 |
+
- **Learning Rate**: Warmup and decay schedules
|
260 |
+
- **Early Stopping**: Prevent overfitting
|
261 |
+
|
262 |
+
#### Monitoring & Debugging
|
263 |
+
- **Metrics Tracking**: Loss, perplexity, learning rate
|
264 |
+
- **GPU Utilization**: Memory and compute monitoring
|
265 |
+
- **Error Handling**: Graceful failure recovery
|
266 |
+
|
267 |
+
See also:
|
268 |
+
- [Architecture Overview](architecture.md)
|
269 |
+
- [Interface Workflow](interface-workflow.md)
|
270 |
+
- [Data Flow](data-flow.md)
|
271 |
+
|
scripts/generate_svgs.py
ADDED
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
Generate SVG versions of Mermaid diagrams for documentation
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import re
|
8 |
+
import requests
|
9 |
+
import json
|
10 |
+
from pathlib import Path
|
11 |
+
from typing import Optional
|
12 |
+
|
13 |
+
class MermaidToSVGConverter:
|
14 |
+
"""Convert Mermaid diagrams to SVG format"""
|
15 |
+
|
16 |
+
def __init__(self):
|
17 |
+
self.mermaid_api_url = "https://mermaid.ink/img/"
|
18 |
+
|
19 |
+
def extract_mermaid_code(self, markdown_file: Path) -> Optional[str]:
|
20 |
+
"""Extract Mermaid code from a Markdown file"""
|
21 |
+
try:
|
22 |
+
with open(markdown_file, 'r', encoding='utf-8') as f:
|
23 |
+
content = f.read()
|
24 |
+
|
25 |
+
# Find Mermaid code blocks
|
26 |
+
mermaid_pattern = r'```mermaid\s*\n(.*?)\n```'
|
27 |
+
match = re.search(mermaid_pattern, content, re.DOTALL)
|
28 |
+
|
29 |
+
if match:
|
30 |
+
return match.group(1).strip()
|
31 |
+
else:
|
32 |
+
print(f"No Mermaid diagram found in {markdown_file}")
|
33 |
+
return None
|
34 |
+
|
35 |
+
except Exception as e:
|
36 |
+
print(f"Error reading {markdown_file}: {e}")
|
37 |
+
return None
|
38 |
+
|
39 |
+
def convert_to_svg(self, mermaid_code: str, output_path: Path) -> bool:
|
40 |
+
"""Convert Mermaid code to SVG using mermaid.ink service"""
|
41 |
+
try:
|
42 |
+
# Encode the Mermaid code for the URL
|
43 |
+
import base64
|
44 |
+
import urllib.parse
|
45 |
+
|
46 |
+
# Create the data URL format expected by mermaid.ink
|
47 |
+
mermaid_data = f"%%{{init: {{'theme': 'base', 'themeVariables': {{'primaryColor': '#e3f2fd', 'primaryTextColor': '#1976d2', 'primaryBorderColor': '#01579b', 'lineColor': '#424242', 'secondaryColor': '#fff3e0', 'tertiaryColor': '#fce4ec'}}}}}}%%\n{mermaid_code}"
|
48 |
+
|
49 |
+
# Base64 encode the mermaid code
|
50 |
+
encoded = base64.b64encode(mermaid_data.encode('utf-8')).decode('utf-8')
|
51 |
+
url_encoded = urllib.parse.quote(encoded)
|
52 |
+
|
53 |
+
# Create the full URL
|
54 |
+
full_url = f"{self.mermaid_api_url}{url_encoded}"
|
55 |
+
|
56 |
+
# Make the request
|
57 |
+
response = requests.get(full_url, timeout=30)
|
58 |
+
|
59 |
+
if response.status_code == 200:
|
60 |
+
# Save the SVG
|
61 |
+
with open(output_path, 'wb') as f:
|
62 |
+
f.write(response.content)
|
63 |
+
print(f"β
Generated SVG: {output_path}")
|
64 |
+
return True
|
65 |
+
else:
|
66 |
+
print(f"β Failed to generate SVG for {output_path}: HTTP {response.status_code}")
|
67 |
+
return False
|
68 |
+
|
69 |
+
except Exception as e:
|
70 |
+
print(f"β Error generating SVG for {output_path}: {e}")
|
71 |
+
return False
|
72 |
+
|
73 |
+
def process_markdown_file(self, markdown_file: Path, output_dir: Path) -> bool:
|
74 |
+
"""Process a single Markdown file and generate its SVG"""
|
75 |
+
# Extract Mermaid code
|
76 |
+
mermaid_code = self.extract_mermaid_code(markdown_file)
|
77 |
+
if not mermaid_code:
|
78 |
+
return False
|
79 |
+
|
80 |
+
# Create output filename
|
81 |
+
svg_filename = markdown_file.stem + ".svg"
|
82 |
+
output_path = output_dir / svg_filename
|
83 |
+
|
84 |
+
# Convert to SVG
|
85 |
+
return self.convert_to_svg(mermaid_code, output_path)
|
86 |
+
|
87 |
+
def main():
|
88 |
+
"""Main function to generate SVGs for all documentation files"""
|
89 |
+
print("π Generating SVG versions of documentation diagrams...")
|
90 |
+
|
91 |
+
# Setup paths
|
92 |
+
docs_dir = Path(__file__).parent.parent / "docs"
|
93 |
+
svgs_dir = docs_dir / "svgs"
|
94 |
+
|
95 |
+
# Create SVGs directory
|
96 |
+
svgs_dir.mkdir(exist_ok=True)
|
97 |
+
|
98 |
+
# Initialize converter
|
99 |
+
converter = MermaidToSVGConverter()
|
100 |
+
|
101 |
+
# Process all Markdown files in docs directory
|
102 |
+
markdown_files = [
|
103 |
+
"README.md",
|
104 |
+
"architecture.md",
|
105 |
+
"interface-workflow.md",
|
106 |
+
"training-pipeline.md",
|
107 |
+
"deployment-pipeline.md",
|
108 |
+
"data-flow.md"
|
109 |
+
]
|
110 |
+
|
111 |
+
success_count = 0
|
112 |
+
total_count = len(markdown_files)
|
113 |
+
|
114 |
+
for filename in markdown_files:
|
115 |
+
markdown_path = docs_dir / filename
|
116 |
+
if markdown_path.exists():
|
117 |
+
print(f"\nπ Processing {filename}...")
|
118 |
+
if converter.process_markdown_file(markdown_path, svgs_dir):
|
119 |
+
success_count += 1
|
120 |
+
else:
|
121 |
+
print(f"β οΈ File not found: {markdown_path}")
|
122 |
+
|
123 |
+
print(f"\nπ SVG generation complete!")
|
124 |
+
print(f"β
Successfully generated: {success_count}/{total_count} SVGs")
|
125 |
+
print(f"π SVGs saved to: {svgs_dir}")
|
126 |
+
|
127 |
+
if success_count < total_count:
|
128 |
+
print(f"β Failed to generate: {total_count - success_count} SVGs")
|
129 |
+
return 1
|
130 |
+
|
131 |
+
return 0
|
132 |
+
|
133 |
+
if __name__ == "__main__":
|
134 |
+
exit(main())
|
135 |
+
|
scripts/validate_mermaid.py
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
Validate Mermaid syntax in HTML documentation
|
4 |
+
"""
|
5 |
+
|
6 |
+
import re
|
7 |
+
|
8 |
+
def validate_mermaid_html(html_file):
|
9 |
+
"""Validate Mermaid diagrams in HTML file"""
|
10 |
+
print(f"π Validating Mermaid syntax in {html_file}")
|
11 |
+
|
12 |
+
with open(html_file, 'r', encoding='utf-8') as f:
|
13 |
+
content = f.read()
|
14 |
+
|
15 |
+
# Find all Mermaid blocks
|
16 |
+
mermaid_pattern = r'<div class="mermaid">(.*?)</div>'
|
17 |
+
mermaid_blocks = re.findall(mermaid_pattern, content, re.DOTALL)
|
18 |
+
|
19 |
+
print(f"π Found {len(mermaid_blocks)} Mermaid blocks")
|
20 |
+
|
21 |
+
issues = []
|
22 |
+
|
23 |
+
# Check each Mermaid block
|
24 |
+
for i, block in enumerate(mermaid_blocks):
|
25 |
+
lines = block.strip().split('\n')
|
26 |
+
if not lines or not lines[0].strip():
|
27 |
+
issues.append(f"Block {i+1}: Empty Mermaid block")
|
28 |
+
continue
|
29 |
+
|
30 |
+
first_line = lines[0].strip()
|
31 |
+
|
32 |
+
# Check if it starts with a valid diagram type
|
33 |
+
valid_starts = [
|
34 |
+
'graph', 'flowchart', 'stateDiagram', 'sequenceDiagram',
|
35 |
+
'classDiagram', 'erDiagram', 'journey', 'gantt', 'pie',
|
36 |
+
'gitgraph', 'mindmap', 'timeline', 'sankey'
|
37 |
+
]
|
38 |
+
|
39 |
+
if not any(first_line.startswith(start) for start in valid_starts):
|
40 |
+
issues.append(f"Block {i+1}: Invalid diagram type start - '{first_line}'")
|
41 |
+
|
42 |
+
# Check for classDef/class consistency
|
43 |
+
if 'classDef' in block:
|
44 |
+
class_statements = len(re.findall(r'^\s*class\s+', block, re.MULTILINE))
|
45 |
+
if class_statements == 0:
|
46 |
+
issues.append(f"Block {i+1}: classDef defined but no class statements found")
|
47 |
+
|
48 |
+
# Check for basic syntax issues
|
49 |
+
if block.count('[') != block.count(']'):
|
50 |
+
issues.append(f"Block {i+1}: Unmatched square brackets")
|
51 |
+
|
52 |
+
if block.count('(') != block.count(')'):
|
53 |
+
issues.append(f"Block {i+1}: Unmatched parentheses")
|
54 |
+
|
55 |
+
if 'subgraph' in block:
|
56 |
+
subgraph_count = block.count('subgraph')
|
57 |
+
end_count = block.count('end')
|
58 |
+
if subgraph_count != end_count:
|
59 |
+
issues.append(f"Block {i+1}: Unmatched subgraph/end blocks ({subgraph_count} vs {end_count})")
|
60 |
+
|
61 |
+
# Report results
|
62 |
+
print("\nπ Validation Results:")
|
63 |
+
if issues:
|
64 |
+
print("β Issues found:")
|
65 |
+
for issue in issues:
|
66 |
+
print(f" - {issue}")
|
67 |
+
return False
|
68 |
+
else:
|
69 |
+
print("β
No syntax issues found!")
|
70 |
+
return True
|
71 |
+
|
72 |
+
if __name__ == "__main__":
|
73 |
+
validate_mermaid_html("docs/diagrams.html")
|