Spaces:
Running
Running
Joseph Pollack
commited on
improves demo for automatic deployment and interface linking to deployment scripts
Browse files- docs/blog-accessibility.md +497 -0
- interface.py +12 -12
- scripts/deploy_demo_space.py +74 -81
- scripts/push_to_huggingface.py +93 -111
- templates/model_card.md +1 -3
- templates/spaces/demo_voxtral/README.md +15 -3
- templates/spaces/demo_voxtral/app.py +82 -15
docs/blog-accessibility.md
ADDED
@@ -0,0 +1,497 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Accessible Speech Recognition: Fine‑tune Voxtral on Your Own Voice
|
2 |
+
|
3 |
+
Building speech technology that understands everyone is an accessibility imperative. If you have a speech impediment (e.g., stutter, dysarthria, apraxia) or a heavy accent, mainstream ASR systems can struggle. This app lets you fine‑tune the Voxtral ASR model on your own voice so it adapts to your unique speaking style — improving recognition accuracy and unlocking more inclusive voice experiences.
|
4 |
+
|
5 |
+
## Who this helps
|
6 |
+
|
7 |
+
- **People with speech differences**: Personalized models that reduce error rates on your voice
|
8 |
+
- **Accented speakers**: Adapt Voxtral to your accent and vocabulary
|
9 |
+
- **Educators/clinicians**: Create tailored recognition models for communication support
|
10 |
+
- **Product teams**: Prototype inclusive voice features with real users quickly
|
11 |
+
|
12 |
+
## What you get
|
13 |
+
|
14 |
+
- **Record or upload audio** and create a JSONL dataset in a few clicks
|
15 |
+
- **One‑click training** with full fine‑tuning or LoRA for efficiency
|
16 |
+
- **Automatic publishing** to Hugging Face Hub with a generated model card
|
17 |
+
- **Instant demo deployment** to HF Spaces for shareable, live ASR
|
18 |
+
|
19 |
+
## How it works (at a glance)
|
20 |
+
|
21 |
+
```mermaid
|
22 |
+
graph TD
|
23 |
+
%% Main Entry Point
|
24 |
+
START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
|
25 |
+
|
26 |
+
%% Documentation Categories
|
27 |
+
OVERVIEW --> ARCH[🏗️ Architecture Overview]
|
28 |
+
OVERVIEW --> WORKFLOW[🔄 Interface Workflow]
|
29 |
+
OVERVIEW --> TRAINING[🚀 Training Pipeline]
|
30 |
+
OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
|
31 |
+
OVERVIEW --> DATAFLOW[📊 Data Flow]
|
32 |
+
|
33 |
+
%% Architecture Section
|
34 |
+
ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
|
35 |
+
ARCH --> ARCH_LINK[📄 View Details →](architecture.md)
|
36 |
+
|
37 |
+
%% Interface Section
|
38 |
+
WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording → Training → Demo]
|
39 |
+
WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md)
|
40 |
+
|
41 |
+
%% Training Section
|
42 |
+
TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data → Model → Results]
|
43 |
+
TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md)
|
44 |
+
|
45 |
+
%% Deployment Section
|
46 |
+
DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model → Hub → Space]
|
47 |
+
DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md)
|
48 |
+
|
49 |
+
%% Data Flow Section
|
50 |
+
DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input → Processing → Output]
|
51 |
+
DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md)
|
52 |
+
|
53 |
+
%% Key Components Highlight
|
54 |
+
subgraph "🎛️ Core Components"
|
55 |
+
INTERFACE[interface.py<br/>Gradio Web UI]
|
56 |
+
TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
|
57 |
+
DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
|
58 |
+
PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
|
59 |
+
end
|
60 |
+
|
61 |
+
%% Data Flow Highlight
|
62 |
+
subgraph "📁 Key Data Formats"
|
63 |
+
JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
|
64 |
+
HFDATA[HF Hub Models<br/>username/model-name]
|
65 |
+
SPACES[HF Spaces<br/>Interactive Demos]
|
66 |
+
end
|
67 |
+
|
68 |
+
%% Connect components to their respective docs
|
69 |
+
INTERFACE --> WORKFLOW
|
70 |
+
TRAIN_SCRIPTS --> TRAINING
|
71 |
+
DEPLOY_SCRIPT --> DEPLOYMENT
|
72 |
+
PUSH_SCRIPT --> DEPLOYMENT
|
73 |
+
|
74 |
+
JSONL --> DATAFLOW
|
75 |
+
HFDATA --> DEPLOYMENT
|
76 |
+
SPACES --> DEPLOYMENT
|
77 |
+
|
78 |
+
%% Styling
|
79 |
+
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
|
80 |
+
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
81 |
+
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
82 |
+
classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
83 |
+
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
84 |
+
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
85 |
+
|
86 |
+
class START entry
|
87 |
+
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
|
88 |
+
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
|
89 |
+
class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
|
90 |
+
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
|
91 |
+
class JSONL,HFDATA,SPACES data
|
92 |
+
```
|
93 |
+
|
94 |
+
See the interactive diagram page for printing and quick navigation: [Interactive diagrams](diagrams.html).
|
95 |
+
|
96 |
+
## Quick start
|
97 |
+
|
98 |
+
### 1) Install
|
99 |
+
|
100 |
+
```bash
|
101 |
+
git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
|
102 |
+
cd Finetune-Voxtral-ASR
|
103 |
+
```
|
104 |
+
|
105 |
+
Use UV (recommended) or pip.
|
106 |
+
|
107 |
+
```bash
|
108 |
+
# UV
|
109 |
+
uv venv .venv --python 3.10 && source .venv/bin/activate
|
110 |
+
uv pip install -r requirements.txt
|
111 |
+
|
112 |
+
# or pip
|
113 |
+
python -m venv .venv --python 3.10 && source .venv/bin/activate
|
114 |
+
pip install --upgrade pip
|
115 |
+
pip install -r requirements.txt
|
116 |
+
```
|
117 |
+
|
118 |
+
### 2) Launch the interface
|
119 |
+
|
120 |
+
```bash
|
121 |
+
python interface.py
|
122 |
+
```
|
123 |
+
|
124 |
+
The Gradio app guides you through language selection, recording or uploading audio, dataset creation, and training.
|
125 |
+
|
126 |
+
## Create your voice dataset (UI)
|
127 |
+
|
128 |
+
```mermaid
|
129 |
+
stateDiagram-v2
|
130 |
+
[*] --> LanguageSelection: User opens interface
|
131 |
+
|
132 |
+
state "Language & Dataset Setup" as LangSetup {
|
133 |
+
[*] --> LanguageSelection
|
134 |
+
LanguageSelection --> LoadPhrases: Select language
|
135 |
+
LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
|
136 |
+
DisplayPhrases --> RecordingInterface: Show phrases & recording UI
|
137 |
+
|
138 |
+
state RecordingInterface {
|
139 |
+
[*] --> ShowInitialRows: Display first 10 phrases
|
140 |
+
ShowInitialRows --> RecordAudio: User can record audio
|
141 |
+
RecordAudio --> AddMoreRows: Optional - add 10 more rows
|
142 |
+
AddMoreRows --> RecordAudio
|
143 |
+
}
|
144 |
+
}
|
145 |
+
|
146 |
+
RecordingInterface --> DatasetCreation: User finishes recording
|
147 |
+
|
148 |
+
state "Dataset Creation Options" as DatasetCreation {
|
149 |
+
[*] --> FromRecordings: Create from recorded audio
|
150 |
+
[*] --> FromUploads: Upload existing files
|
151 |
+
|
152 |
+
FromRecordings --> ProcessRecordings: Save WAV files + transcripts
|
153 |
+
FromUploads --> ProcessUploads: Process uploaded files + transcripts
|
154 |
+
|
155 |
+
ProcessRecordings --> CreateJSONL: Generate JSONL dataset
|
156 |
+
ProcessUploads --> CreateJSONL
|
157 |
+
|
158 |
+
CreateJSONL --> DatasetReady: Dataset saved locally
|
159 |
+
}
|
160 |
+
|
161 |
+
DatasetCreation --> TrainingConfiguration: Dataset ready
|
162 |
+
|
163 |
+
state "Training Setup" as TrainingConfiguration {
|
164 |
+
[*] --> BasicSettings: Model, LoRA/full, batch size
|
165 |
+
[*] --> AdvancedSettings: Learning rate, epochs, LoRA params
|
166 |
+
|
167 |
+
BasicSettings --> ConfigureDeployment: Repo name, push options
|
168 |
+
AdvancedSettings --> ConfigureDeployment
|
169 |
+
|
170 |
+
ConfigureDeployment --> StartTraining: All settings configured
|
171 |
+
}
|
172 |
+
|
173 |
+
TrainingConfiguration --> TrainingProcess: Start training
|
174 |
+
|
175 |
+
state "Training Process" as TrainingProcess {
|
176 |
+
[*] --> InitializeTrackio: Setup experiment tracking
|
177 |
+
InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
|
178 |
+
RunTrainingScript --> StreamLogs: Show real-time training logs
|
179 |
+
StreamLogs --> MonitorProgress: Track metrics & checkpoints
|
180 |
+
|
181 |
+
MonitorProgress --> TrainingComplete: Training finished
|
182 |
+
MonitorProgress --> HandleErrors: Training failed
|
183 |
+
HandleErrors --> RetryOrExit: User can retry or exit
|
184 |
+
}
|
185 |
+
|
186 |
+
TrainingProcess --> PostTraining: Training complete
|
187 |
+
|
188 |
+
state "Post-Training Actions" as PostTraining {
|
189 |
+
[*] --> PushToHub: Push model to HF Hub
|
190 |
+
[*] --> GenerateModelCard: Create model card
|
191 |
+
[*] --> DeployDemoSpace: Deploy interactive demo
|
192 |
+
|
193 |
+
PushToHub --> ModelPublished: Model available on HF Hub
|
194 |
+
GenerateModelCard --> ModelDocumented: Model card created
|
195 |
+
DeployDemoSpace --> DemoReady: Demo space deployed
|
196 |
+
}
|
197 |
+
|
198 |
+
PostTraining --> [*]: Process complete
|
199 |
+
|
200 |
+
%% Alternative paths
|
201 |
+
DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
|
202 |
+
PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
|
203 |
+
|
204 |
+
%% Error handling
|
205 |
+
TrainingProcess --> ErrorRecovery: Handle training errors
|
206 |
+
ErrorRecovery --> RetryTraining: Retry with different settings
|
207 |
+
RetryTraining --> TrainingConfiguration
|
208 |
+
|
209 |
+
%% Styling and notes
|
210 |
+
note right of LanguageSelection : User selects language for\n authentic phrases from\n NVIDIA Granary dataset
|
211 |
+
note right of RecordingInterface : Users record themselves\n reading displayed phrases
|
212 |
+
note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
|
213 |
+
note right of TrainingConfiguration : Configure LoRA parameters,\n learning rate, epochs, etc.
|
214 |
+
note right of TrainingProcess : Real-time log streaming\n with Trackio integration
|
215 |
+
note right of PostTraining : Automated deployment\n pipeline
|
216 |
+
```
|
217 |
+
|
218 |
+
Steps you’ll follow in the UI:
|
219 |
+
|
220 |
+
- **Choose language**: Select a language for authentic phrases (from NVIDIA Granary)
|
221 |
+
- **Record or upload**: Capture your voice or provide existing audio + transcripts
|
222 |
+
- **Create dataset**: The app writes a JSONL file with entries like `{ "audio_path": ..., "text": ... }`
|
223 |
+
- **Configure training**: Pick base model, LoRA vs full, batch size and learning rate
|
224 |
+
- **Run training**: Watch live logs and metrics; resume on error if needed
|
225 |
+
- **Publish & deploy**: Push to HF Hub and one‑click deploy an interactive Space
|
226 |
+
|
227 |
+
## Train your personalized Voxtral model
|
228 |
+
|
229 |
+
Under the hood, training uses Hugging Face Trainer and a custom `VoxtralDataCollator` that builds Voxtral/LLaMA‑style prompts and masks the prompt tokens so loss is computed only on the transcription.
|
230 |
+
|
231 |
+
```mermaid
|
232 |
+
graph TB
|
233 |
+
%% Input Data Sources
|
234 |
+
subgraph "Data Sources"
|
235 |
+
JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
|
236 |
+
GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
|
237 |
+
HFDATA[HF Hub Datasets<br/>Community Datasets]
|
238 |
+
end
|
239 |
+
|
240 |
+
%% Data Processing
|
241 |
+
subgraph "Data Processing"
|
242 |
+
LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
|
243 |
+
CASTER[Audio Casting<br/>16kHz resampling]
|
244 |
+
COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
|
245 |
+
end
|
246 |
+
|
247 |
+
%% Training Scripts
|
248 |
+
subgraph "Training Scripts"
|
249 |
+
TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
|
250 |
+
TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]
|
251 |
+
|
252 |
+
subgraph "Training Components"
|
253 |
+
MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
|
254 |
+
LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
|
255 |
+
PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
|
256 |
+
end
|
257 |
+
end
|
258 |
+
|
259 |
+
%% Training Infrastructure
|
260 |
+
subgraph "Training Infrastructure"
|
261 |
+
TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
|
262 |
+
HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
|
263 |
+
TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
|
264 |
+
end
|
265 |
+
|
266 |
+
%% Training Process
|
267 |
+
subgraph "Training Process"
|
268 |
+
FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
|
269 |
+
LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
|
270 |
+
BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
|
271 |
+
OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
|
272 |
+
LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
|
273 |
+
end
|
274 |
+
|
275 |
+
%% Model Management
|
276 |
+
subgraph "Model Management"
|
277 |
+
CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
|
278 |
+
MODEL_SAVING[Final Model Saving<br/>Processor + Model]
|
279 |
+
LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
|
280 |
+
end
|
281 |
+
|
282 |
+
%% Flow Connections
|
283 |
+
JSONL --> LOADER
|
284 |
+
GRANARY --> LOADER
|
285 |
+
HFDATA --> LOADER
|
286 |
+
|
287 |
+
LOADER --> CASTER
|
288 |
+
CASTER --> COLLATOR
|
289 |
+
|
290 |
+
COLLATOR --> TRAIN_FULL
|
291 |
+
COLLATOR --> TRAIN_LORA
|
292 |
+
|
293 |
+
TRAIN_FULL --> MODEL_INIT
|
294 |
+
TRAIN_LORA --> MODEL_INIT
|
295 |
+
TRAIN_LORA --> LORA_CONFIG
|
296 |
+
|
297 |
+
MODEL_INIT --> PROCESSOR_INIT
|
298 |
+
LORA_CONFIG --> PROCESSOR_INIT
|
299 |
+
|
300 |
+
PROCESSOR_INIT --> TRACKIO_INIT
|
301 |
+
PROCESSOR_INIT --> HF_TRAINER
|
302 |
+
PROCESSOR_INIT --> TORCH_DEVICE
|
303 |
+
|
304 |
+
TRACKIO_INIT --> HF_TRAINER
|
305 |
+
TORCH_DEVICE --> HF_TRAINER
|
306 |
+
|
307 |
+
HF_TRAINER --> FORWARD_PASS
|
308 |
+
FORWARD_PASS --> LOSS_CALC
|
309 |
+
LOSS_CALC --> BACKWARD_PASS
|
310 |
+
BACKWARD_PASS --> OPTIMIZER_STEP
|
311 |
+
OPTIMIZER_STEP --> LOGGING
|
312 |
+
|
313 |
+
LOGGING --> CHECKPOINT_SAVING
|
314 |
+
LOGGING --> TRACKIO_INIT
|
315 |
+
|
316 |
+
HF_TRAINER --> MODEL_SAVING
|
317 |
+
MODEL_SAVING --> LOCAL_STORAGE
|
318 |
+
|
319 |
+
%% Styling
|
320 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
321 |
+
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
322 |
+
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
323 |
+
classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
324 |
+
classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
325 |
+
classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
326 |
+
|
327 |
+
class JSONL,GRANARY,HFDATA input
|
328 |
+
class LOADER,CASTER,COLLATOR processing
|
329 |
+
class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
|
330 |
+
class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
|
331 |
+
class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
|
332 |
+
class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
|
333 |
+
```
|
334 |
+
|
335 |
+
CLI alternatives (if you prefer the terminal):
|
336 |
+
|
337 |
+
```bash
|
338 |
+
# Full fine-tuning
|
339 |
+
uv run train.py
|
340 |
+
|
341 |
+
# Parameter‑efficient LoRA fine‑tuning (recommended for most users)
|
342 |
+
uv run train_lora.py
|
343 |
+
```
|
344 |
+
|
345 |
+
## Publish and deploy a live demo
|
346 |
+
|
347 |
+
After training, the app can push your model and metrics to the Hugging Face Hub and create an interactive Space demo automatically.
|
348 |
+
|
349 |
+
```mermaid
|
350 |
+
graph TB
|
351 |
+
%% Input Sources
|
352 |
+
subgraph "Inputs"
|
353 |
+
TRAINED_MODEL[Trained Model<br/>Local directory]
|
354 |
+
TRAINING_CONFIG[Training Config<br/>JSON/YAML]
|
355 |
+
TRAINING_RESULTS[Training Results<br/>Metrics & logs]
|
356 |
+
MODEL_METADATA[Model Metadata<br/>Name, description, etc.]
|
357 |
+
end
|
358 |
+
|
359 |
+
%% Model Publishing
|
360 |
+
subgraph "Model Publishing"
|
361 |
+
PUSH_SCRIPT[push_to_huggingface.py<br/>Model Publisher]
|
362 |
+
|
363 |
+
subgraph "Publishing Steps"
|
364 |
+
REPO_CREATION[Repository Creation<br/>HF Hub API]
|
365 |
+
FILE_UPLOAD[File Upload<br/>Model files to HF]
|
366 |
+
METADATA_UPLOAD[Metadata Upload<br/>Config & results]
|
367 |
+
end
|
368 |
+
end
|
369 |
+
|
370 |
+
%% Model Card Generation
|
371 |
+
subgraph "Model Card Generation"
|
372 |
+
CARD_SCRIPT[generate_model_card.py<br/>Card Generator]
|
373 |
+
|
374 |
+
subgraph "Card Components"
|
375 |
+
TEMPLATE_LOAD[Template Loading<br/>model_card.md]
|
376 |
+
VARIABLE_REPLACEMENT[Variable Replacement<br/>Config injection]
|
377 |
+
CONDITIONAL_PROCESSING[Conditional Sections<br/>Quantized models, etc.]
|
378 |
+
end
|
379 |
+
end
|
380 |
+
|
381 |
+
%% Demo Space Deployment
|
382 |
+
subgraph "Demo Space Deployment"
|
383 |
+
DEPLOY_SCRIPT[deploy_demo_space.py<br/>Space Deployer]
|
384 |
+
|
385 |
+
subgraph "Space Setup"
|
386 |
+
SPACE_CREATION[Space Repository<br/>Create HF Space]
|
387 |
+
TEMPLATE_COPY[Template Copying<br/>demo_voxtral/ files]
|
388 |
+
ENV_INJECTION[Environment Setup<br/>Model config injection]
|
389 |
+
SECRET_SETUP[Secret Configuration<br/>HF_TOKEN, model vars]
|
390 |
+
end
|
391 |
+
end
|
392 |
+
|
393 |
+
%% Space Building & Testing
|
394 |
+
subgraph "Space Building"
|
395 |
+
BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
|
396 |
+
DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
|
397 |
+
MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
|
398 |
+
APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
|
399 |
+
end
|
400 |
+
|
401 |
+
%% Live Demo
|
402 |
+
subgraph "Live Demo Space"
|
403 |
+
GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
|
404 |
+
MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
|
405 |
+
USER_INTERACTION[User Interaction<br/>Audio upload/playback]
|
406 |
+
end
|
407 |
+
|
408 |
+
%% External Services
|
409 |
+
subgraph "External Services"
|
410 |
+
HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
|
411 |
+
HF_SPACES[HF Spaces Platform<br/>Demo hosting]
|
412 |
+
end
|
413 |
+
|
414 |
+
%% Flow Connections
|
415 |
+
TRAINED_MODEL --> PUSH_SCRIPT
|
416 |
+
TRAINING_CONFIG --> PUSH_SCRIPT
|
417 |
+
TRAINING_RESULTS --> PUSH_SCRIPT
|
418 |
+
MODEL_METADATA --> PUSH_SCRIPT
|
419 |
+
|
420 |
+
PUSH_SCRIPT --> REPO_CREATION
|
421 |
+
REPO_CREATION --> FILE_UPLOAD
|
422 |
+
FILE_UPLOAD --> METADATA_UPLOAD
|
423 |
+
|
424 |
+
METADATA_UPLOAD --> CARD_SCRIPT
|
425 |
+
TRAINING_CONFIG --> CARD_SCRIPT
|
426 |
+
TRAINING_RESULTS --> CARD_SCRIPT
|
427 |
+
|
428 |
+
CARD_SCRIPT --> TEMPLATE_LOAD
|
429 |
+
TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
|
430 |
+
VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
|
431 |
+
|
432 |
+
CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
|
433 |
+
METADATA_UPLOAD --> DEPLOY_SCRIPT
|
434 |
+
|
435 |
+
DEPLOY_SCRIPT --> SPACE_CREATION
|
436 |
+
SPACE_CREATION --> TEMPLATE_COPY
|
437 |
+
TEMPLATE_COPY --> ENV_INJECTION
|
438 |
+
ENV_INJECTION --> SECRET_SETUP
|
439 |
+
|
440 |
+
SECRET_SETUP --> BUILD_TRIGGER
|
441 |
+
BUILD_TRIGGER --> DEPENDENCY_INSTALL
|
442 |
+
DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
|
443 |
+
MODEL_DOWNLOAD --> APP_INITIALIZATION
|
444 |
+
|
445 |
+
APP_INITIALIZATION --> GRADIO_INTERFACE
|
446 |
+
GRADIO_INTERFACE --> MODEL_INFERENCE
|
447 |
+
MODEL_INFERENCE --> USER_INTERACTION
|
448 |
+
|
449 |
+
HF_HUB --> MODEL_DOWNLOAD
|
450 |
+
HF_SPACES --> GRADIO_INTERFACE
|
451 |
+
|
452 |
+
%% Styling
|
453 |
+
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
|
454 |
+
classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
|
455 |
+
classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
|
456 |
+
classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
|
457 |
+
classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
|
458 |
+
classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
|
459 |
+
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
|
460 |
+
|
461 |
+
class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
|
462 |
+
class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
|
463 |
+
class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
|
464 |
+
class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
|
465 |
+
class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
|
466 |
+
class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
|
467 |
+
class HF_HUB,HF_SPACES external
|
468 |
+
```
|
469 |
+
|
470 |
+
## Why personalization improves accessibility
|
471 |
+
|
472 |
+
- **Your model learns your patterns**: tempo, prosody, phoneme realizations, disfluencies
|
473 |
+
- **Vocabulary and names**: teach domain terms and proper nouns you use often
|
474 |
+
- **Bias correction**: reduce systematic errors common to off‑the‑shelf ASR for your voice
|
475 |
+
- **Agency and privacy**: keep data local and only publish when you choose
|
476 |
+
|
477 |
+
## Practical tips
|
478 |
+
|
479 |
+
- **Start with LoRA**: Parameter‑efficient fine‑tuning is faster and uses less memory
|
480 |
+
- **Record diverse samples**: Different tempos, environments, and phrase lengths
|
481 |
+
- **Short sessions**: Many shorter clips beat a few long ones for learning
|
482 |
+
- **Check transcripts**: Clean, accurate transcripts improve outcomes
|
483 |
+
|
484 |
+
## Learn more
|
485 |
+
|
486 |
+
- [Repository README](../README.md)
|
487 |
+
- [Documentation Overview](README.md)
|
488 |
+
- [Architecture Overview](architecture.md)
|
489 |
+
- [Interface Workflow](interface-workflow.md)
|
490 |
+
- [Training Pipeline](training-pipeline.md)
|
491 |
+
- [Deployment Pipeline](deployment-pipeline.md)
|
492 |
+
- [Data Flow](data-flow.md)
|
493 |
+
- [Interactive Diagrams](diagrams.html)
|
494 |
+
|
495 |
+
---
|
496 |
+
|
497 |
+
This project exists to make voice technology work better for everyone. If you build a model that helps you — or your community — consider sharing a demo so others can learn from it.
|
interface.py
CHANGED
@@ -745,15 +745,15 @@ with gr.Blocks(title="Voxtral ASR Fine-tuning") as demo:
|
|
745 |
def _collect_upload(files, txt):
|
746 |
lines = [s.strip() for s in (txt or "").splitlines() if s.strip()]
|
747 |
jsonl_path = _save_uploaded_dataset(files or [], lines)
|
748 |
-
return f"✅ Dataset saved locally: {jsonl_path}"
|
749 |
|
750 |
-
def _push_dataset_handler(repo_name):
|
751 |
-
if not
|
752 |
return "❌ No dataset saved yet. Please save dataset first."
|
753 |
-
return _push_dataset_to_hub(
|
754 |
|
755 |
-
save_upload_btn.click(_collect_upload, [upload_audio, transcripts_box], [jsonl_path_state])
|
756 |
-
push_dataset_btn.click(_push_dataset_handler, [dataset_repo_name], [
|
757 |
|
758 |
# Save recordings button
|
759 |
save_rec_btn = gr.Button("Save recordings as dataset", visible=False)
|
@@ -782,16 +782,16 @@ with gr.Blocks(title="Voxtral ASR Fine-tuning") as demo:
|
|
782 |
rows.append({"audio_path": str(out_path), "text": label_text})
|
783 |
jsonl_path = dataset_dir / "data.jsonl"
|
784 |
_write_jsonl(rows, jsonl_path)
|
785 |
-
return str(jsonl_path)
|
786 |
|
787 |
-
save_rec_btn.click(_collect_preloaded_recs, rec_components + [phrase_texts_state], [jsonl_path_state])
|
788 |
|
789 |
-
def _push_recordings_handler(repo_name):
|
790 |
-
if not
|
791 |
return "❌ No recordings dataset saved yet. Please save recordings first."
|
792 |
-
return _push_dataset_to_hub(
|
793 |
|
794 |
-
push_recordings_btn.click(_push_recordings_handler, [dataset_repo_name], [
|
795 |
|
796 |
# Removed multilingual dataset sample section - phrases are now loaded automatically when language is selected
|
797 |
|
|
|
745 |
def _collect_upload(files, txt):
|
746 |
lines = [s.strip() for s in (txt or "").splitlines() if s.strip()]
|
747 |
jsonl_path = _save_uploaded_dataset(files or [], lines)
|
748 |
+
return str(jsonl_path), f"✅ Dataset saved locally: {jsonl_path}"
|
749 |
|
750 |
+
def _push_dataset_handler(repo_name, current_jsonl_path):
|
751 |
+
if not current_jsonl_path:
|
752 |
return "❌ No dataset saved yet. Please save dataset first."
|
753 |
+
return _push_dataset_to_hub(current_jsonl_path, repo_name)
|
754 |
|
755 |
+
save_upload_btn.click(_collect_upload, [upload_audio, transcripts_box], [jsonl_path_state, dataset_status])
|
756 |
+
push_dataset_btn.click(_push_dataset_handler, [dataset_repo_name, jsonl_path_state], [dataset_status])
|
757 |
|
758 |
# Save recordings button
|
759 |
save_rec_btn = gr.Button("Save recordings as dataset", visible=False)
|
|
|
782 |
rows.append({"audio_path": str(out_path), "text": label_text})
|
783 |
jsonl_path = dataset_dir / "data.jsonl"
|
784 |
_write_jsonl(rows, jsonl_path)
|
785 |
+
return str(jsonl_path), f"✅ Dataset saved locally: {jsonl_path}"
|
786 |
|
787 |
+
save_rec_btn.click(_collect_preloaded_recs, rec_components + [phrase_texts_state], [jsonl_path_state, dataset_status])
|
788 |
|
789 |
+
def _push_recordings_handler(repo_name, current_jsonl_path):
|
790 |
+
if not current_jsonl_path:
|
791 |
return "❌ No recordings dataset saved yet. Please save recordings first."
|
792 |
+
return _push_dataset_to_hub(current_jsonl_path, repo_name)
|
793 |
|
794 |
+
push_recordings_btn.click(_push_recordings_handler, [dataset_repo_name, jsonl_path_state], [dataset_status])
|
795 |
|
796 |
# Removed multilingual dataset sample section - phrases are now loaded automatically when language is selected
|
797 |
|
scripts/deploy_demo_space.py
CHANGED
@@ -25,11 +25,9 @@ except ImportError:
|
|
25 |
HF_HUB_AVAILABLE = False
|
26 |
print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
|
27 |
|
28 |
-
# Add src to path for imports
|
29 |
sys.path.append(str(Path(__file__).parent.parent / "src"))
|
30 |
|
31 |
-
from config import SmolLM3Config
|
32 |
-
|
33 |
# Setup logging
|
34 |
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
35 |
logger = logging.getLogger(__name__)
|
@@ -223,14 +221,9 @@ os.environ['BRAND_PROJECT_URL'] = {_json.dumps(self.brand_project_url)}
|
|
223 |
|
224 |
"""
|
225 |
elif self.demo_type == "voxtral":
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
import os
|
230 |
-
os.environ['HF_MODEL_ID'] = {_json.dumps(self.model_id)}
|
231 |
-
os.environ['MODEL_NAME'] = {_json.dumps(self.model_id.split('/')[-1])}
|
232 |
-
os.environ['HF_USERNAME'] = {_json.dumps(self.hf_username)}
|
233 |
-
"""
|
234 |
else:
|
235 |
# For SmolLM models, use simpler setup
|
236 |
import json as _json
|
@@ -534,80 +527,80 @@ os.environ['BRAND_PROJECT_URL'] = {_json.dumps(self.brand_project_url)}
|
|
534 |
copied_files.append(file_path.name)
|
535 |
logger.info(f"✅ Copied {file_path.name} to temp directory")
|
536 |
|
537 |
-
# Update app.py with environment variables
|
538 |
app_file = Path(temp_dir) / "app.py"
|
539 |
-
if app_file.exists():
|
540 |
with open(app_file, 'r', encoding='utf-8') as f:
|
541 |
content = f.read()
|
542 |
-
|
543 |
-
# Add environment variable setup at the top
|
544 |
env_setup = self._generate_env_setup()
|
545 |
-
|
546 |
-
# Insert after imports
|
547 |
-
lines = content.split('\n')
|
548 |
-
import_end = 0
|
549 |
-
for i, line in enumerate(lines):
|
550 |
-
if line.startswith('import ') or line.startswith('from '):
|
551 |
-
import_end = i + 1
|
552 |
-
elif line.strip() == '' and import_end > 0:
|
553 |
-
break
|
554 |
-
|
555 |
-
lines.insert(import_end, env_setup)
|
556 |
-
content = '\n'.join(lines)
|
557 |
-
|
558 |
-
with open(app_file, 'w', encoding='utf-8') as f:
|
559 |
-
f.write(content)
|
560 |
-
|
561 |
-
logger.info("✅ Updated app.py with model configuration")
|
562 |
-
|
563 |
-
# YAML front matter required by Hugging Face Spaces
|
564 |
-
yaml_front_matter = (
|
565 |
-
f"---\n"
|
566 |
-
f"title: {'GPT-OSS Demo' if self.demo_type == 'gpt' else 'SmolLM3 Demo'}\n"
|
567 |
-
f"emoji: {'🌟' if self.demo_type == 'gpt' else '💃🏻'}\n"
|
568 |
-
f"colorFrom: {'blue' if self.demo_type == 'gpt' else 'green'}\n"
|
569 |
-
f"colorTo: {'pink' if self.demo_type == 'gpt' else 'purple'}\n"
|
570 |
-
f"sdk: gradio\n"
|
571 |
-
f"sdk_version: 5.40.0\n"
|
572 |
-
f"app_file: app.py\n"
|
573 |
-
f"pinned: false\n"
|
574 |
-
f"short_description: Interactive demo for {self.model_id}\n"
|
575 |
-
+ ("license: mit\n" if self.demo_type != 'gpt' else "") +
|
576 |
-
f"---\n\n"
|
577 |
-
)
|
578 |
|
579 |
-
|
580 |
-
|
581 |
-
|
582 |
-
|
583 |
-
|
584 |
-
|
585 |
-
|
586 |
-
|
587 |
-
|
588 |
-
|
589 |
-
|
590 |
-
|
591 |
-
|
592 |
-
|
593 |
-
|
594 |
-
|
595 |
-
|
596 |
-
|
597 |
-
|
598 |
-
|
599 |
-
|
600 |
-
|
601 |
-
|
602 |
-
|
603 |
-
|
604 |
-
|
605 |
-
|
606 |
-
|
607 |
-
|
608 |
-
|
609 |
-
|
610 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
611 |
|
612 |
logger.info(f"✅ Prepared {len(copied_files)} files in temporary directory")
|
613 |
return temp_dir
|
@@ -874,7 +867,7 @@ def main():
|
|
874 |
parser.add_argument("--model-id", required=True, help="Model ID to deploy demo for")
|
875 |
parser.add_argument("--subfolder", default="int4", help="Model subfolder (default: int4)")
|
876 |
parser.add_argument("--space-name", help="Custom space name (optional)")
|
877 |
-
parser.add_argument("--demo-type", choices=["smol", "gpt"], help="Demo type: 'smol' for SmolLM, 'gpt' for GPT-OSS (auto-detected if not specified)")
|
878 |
parser.add_argument("--config-file", help="Path to the training config file to import context (system/developer/model_identity)")
|
879 |
# Examples configuration
|
880 |
parser.add_argument("--examples-type", choices=["general", "medical"], help="Examples pack to enable in the demo UI")
|
|
|
25 |
HF_HUB_AVAILABLE = False
|
26 |
print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
|
27 |
|
28 |
+
# Add src to path for imports (kept for potential future imports)
|
29 |
sys.path.append(str(Path(__file__).parent.parent / "src"))
|
30 |
|
|
|
|
|
31 |
# Setup logging
|
32 |
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
33 |
logger = logging.getLogger(__name__)
|
|
|
221 |
|
222 |
"""
|
223 |
elif self.demo_type == "voxtral":
|
224 |
+
# For Voxtral, we do not inject env setup into app.py.
|
225 |
+
# Space variables are set via the API in set_space_secrets().
|
226 |
+
env_setup = ""
|
|
|
|
|
|
|
|
|
|
|
227 |
else:
|
228 |
# For SmolLM models, use simpler setup
|
229 |
import json as _json
|
|
|
527 |
copied_files.append(file_path.name)
|
528 |
logger.info(f"✅ Copied {file_path.name} to temp directory")
|
529 |
|
530 |
+
# Update app.py with environment variables (skip for Voxtral)
|
531 |
app_file = Path(temp_dir) / "app.py"
|
532 |
+
if app_file.exists() and self.demo_type != "voxtral":
|
533 |
with open(app_file, 'r', encoding='utf-8') as f:
|
534 |
content = f.read()
|
535 |
+
|
|
|
536 |
env_setup = self._generate_env_setup()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
537 |
|
538 |
+
if env_setup:
|
539 |
+
# Insert after imports
|
540 |
+
lines = content.split('\n')
|
541 |
+
import_end = 0
|
542 |
+
for i, line in enumerate(lines):
|
543 |
+
if line.startswith('import ') or line.startswith('from '):
|
544 |
+
import_end = i + 1
|
545 |
+
elif line.strip() == '' and import_end > 0:
|
546 |
+
break
|
547 |
+
|
548 |
+
lines.insert(import_end, env_setup)
|
549 |
+
content = '\n'.join(lines)
|
550 |
+
|
551 |
+
with open(app_file, 'w', encoding='utf-8') as f:
|
552 |
+
f.write(content)
|
553 |
+
|
554 |
+
logger.info("✅ Updated app.py with model configuration")
|
555 |
+
|
556 |
+
# For Voxtral keep the template README. For others, create a README with YAML front matter.
|
557 |
+
if self.demo_type != "voxtral":
|
558 |
+
yaml_front_matter = (
|
559 |
+
f"---\n"
|
560 |
+
f"title: {'GPT-OSS Demo' if self.demo_type == 'gpt' else 'SmolLM3 Demo'}\n"
|
561 |
+
f"emoji: {'🌟' if self.demo_type == 'gpt' else '💃🏻'}\n"
|
562 |
+
f"colorFrom: {'blue' if self.demo_type == 'gpt' else 'green'}\n"
|
563 |
+
f"colorTo: {'pink' if self.demo_type == 'gpt' else 'purple'}\n"
|
564 |
+
f"sdk: gradio\n"
|
565 |
+
f"sdk_version: 5.40.0\n"
|
566 |
+
f"app_file: app.py\n"
|
567 |
+
f"pinned: false\n"
|
568 |
+
f"short_description: Interactive demo for {self.model_id}\n"
|
569 |
+
+ ("license: mit\n" if self.demo_type != 'gpt' else "") +
|
570 |
+
f"---\n\n"
|
571 |
+
)
|
572 |
+
|
573 |
+
readme_content = (
|
574 |
+
yaml_front_matter
|
575 |
+
+ f"# Demo: {self.model_id}\n\n"
|
576 |
+
+ f"This is an interactive demo for the fine-tuned model {self.model_id}.\n\n"
|
577 |
+
+ "## Features\n"
|
578 |
+
"- Interactive chat interface\n"
|
579 |
+
"- Customizable system & developer prompts\n"
|
580 |
+
"- Advanced generation parameters\n"
|
581 |
+
"- Thinking mode support\n\n"
|
582 |
+
+ "## Model Information\n"
|
583 |
+
f"- **Model ID**: {self.model_id}\n"
|
584 |
+
f"- **Subfolder**: {self.subfolder if self.subfolder and self.subfolder.strip() else 'main'}\n"
|
585 |
+
f"- **Deployed by**: {self.hf_username}\n"
|
586 |
+
+ ("- **Base Model**: openai/gpt-oss-20b\n" if self.demo_type == 'gpt' else "")
|
587 |
+
+ "\n"
|
588 |
+
+ "## Configuration\n"
|
589 |
+
"- **Model Identity**:\n\n"
|
590 |
+
f"```\n{self.model_identity or 'Not set'}\n```\n\n"
|
591 |
+
"- **System Message** (default):\n\n"
|
592 |
+
f"```\n{(self.system_message or self.model_identity) or 'Not set'}\n```\n\n"
|
593 |
+
"- **Developer Message** (default):\n\n"
|
594 |
+
f"```\n{self.developer_message or 'Not set'}\n```\n\n"
|
595 |
+
"These defaults come from the selected training configuration and can be adjusted in the UI when you run the demo.\n\n"
|
596 |
+
+ "## Usage\n"
|
597 |
+
"Simply start chatting with the model using the interface below!\n\n"
|
598 |
+
+ "---\n"
|
599 |
+
"*This demo was automatically deployed by the SmolFactory Fine-tuning Pipeline*\n"
|
600 |
+
)
|
601 |
+
|
602 |
+
with open(Path(temp_dir) / "README.md", 'w', encoding='utf-8') as f:
|
603 |
+
f.write(readme_content)
|
604 |
|
605 |
logger.info(f"✅ Prepared {len(copied_files)} files in temporary directory")
|
606 |
return temp_dir
|
|
|
867 |
parser.add_argument("--model-id", required=True, help="Model ID to deploy demo for")
|
868 |
parser.add_argument("--subfolder", default="int4", help="Model subfolder (default: int4)")
|
869 |
parser.add_argument("--space-name", help="Custom space name (optional)")
|
870 |
+
parser.add_argument("--demo-type", choices=["smol", "gpt", "voxtral"], help="Demo type: 'smol' for SmolLM, 'gpt' for GPT-OSS, 'voxtral' for Voxtral ASR (auto-detected if not specified)")
|
871 |
parser.add_argument("--config-file", help="Path to the training config file to import context (system/developer/model_identity)")
|
872 |
# Examples configuration
|
873 |
parser.add_argument("--examples-type", choices=["general", "medical"], help="Examples pack to enable in the demo UI")
|
scripts/push_to_huggingface.py
CHANGED
@@ -69,6 +69,8 @@ class HuggingFacePusher:
|
|
69 |
|
70 |
# Resolve the full repo id (username/repo) if user only provided repo name
|
71 |
self.repo_id = self._resolve_repo_id(self.repo_name)
|
|
|
|
|
72 |
|
73 |
logger.info(f"Initialized HuggingFacePusher for {self.repo_id}")
|
74 |
|
@@ -133,37 +135,57 @@ class HuggingFacePusher:
|
|
133 |
logger.error(f"❌ Failed to create repository: {e}")
|
134 |
return False
|
135 |
|
136 |
-
def
|
137 |
-
"""
|
138 |
-
#
|
139 |
-
|
140 |
-
"
|
141 |
-
"
|
142 |
-
"
|
143 |
]
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
|
|
|
|
|
|
|
|
149 |
]
|
150 |
-
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
if
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
167 |
|
168 |
def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
|
169 |
"""Create a comprehensive model card using the generate_model_card.py script"""
|
@@ -215,88 +237,48 @@ class HuggingFacePusher:
|
|
215 |
return self._create_simple_model_card(training_config, results)
|
216 |
|
217 |
def _create_simple_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
|
218 |
-
"""Create a simple model card
|
219 |
-
|
220 |
-
|
221 |
-
|
222 |
-
|
223 |
-
license: apache-2.0
|
224 |
-
tags:
|
225 |
-
-
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
230 |
-
|
231 |
-
|
232 |
-
|
233 |
-
# {
|
234 |
-
|
235 |
-
This
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
-
|
242 |
-
|
243 |
-
|
244 |
-
|
245 |
-
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
-
{json.dumps(results, indent=2)}
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
```python
|
261 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
262 |
-
|
263 |
-
# Load model and tokenizer
|
264 |
-
model = AutoModelForCausalLM.from_pretrained("{self.repo_id}")
|
265 |
-
tokenizer = AutoTokenizer.from_pretrained("{self.repo_id}")
|
266 |
-
|
267 |
-
# Generate text
|
268 |
-
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
|
269 |
-
outputs = model.generate(**inputs, max_new_tokens=100)
|
270 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
271 |
-
```
|
272 |
-
|
273 |
-
## Training Information
|
274 |
-
|
275 |
-
- **Base Model**: HuggingFaceTB/SmolLM3-3B
|
276 |
-
- **Hardware**: {self._get_hardware_info()}
|
277 |
-
- **Training Time**: {results.get('training_time_hours', 'Unknown')} hours
|
278 |
-
- **Final Loss**: {results.get('final_loss', 'Unknown')}
|
279 |
-
- **Final Accuracy**: {results.get('final_accuracy', 'Unknown')}
|
280 |
-
- **Dataset Repository**: {self.dataset_repo}
|
281 |
-
|
282 |
-
## Model Performance
|
283 |
-
|
284 |
-
- **Training Loss**: {results.get('train_loss', 'Unknown')}
|
285 |
-
- **Validation Loss**: {results.get('eval_loss', 'Unknown')}
|
286 |
-
- **Training Steps**: {results.get('total_steps', 'Unknown')}
|
287 |
-
|
288 |
-
## Experiment Tracking
|
289 |
-
|
290 |
-
This model was trained with experiment tracking enabled. Training metrics and configuration are stored in the HF Dataset repository: `{self.dataset_repo}`
|
291 |
-
|
292 |
-
## Limitations and Biases
|
293 |
-
|
294 |
-
This model is fine-tuned for specific tasks and may not generalize well to all use cases. Please evaluate the model's performance on your specific task before deployment.
|
295 |
-
|
296 |
-
## License
|
297 |
-
|
298 |
-
This model is licensed under the Apache 2.0 License.
|
299 |
-
"""
|
300 |
|
301 |
def _get_model_size(self) -> float:
|
302 |
"""Get model size in GB"""
|
|
|
69 |
|
70 |
# Resolve the full repo id (username/repo) if user only provided repo name
|
71 |
self.repo_id = self._resolve_repo_id(self.repo_name)
|
72 |
+
# Artifact type detection (full vs lora)
|
73 |
+
self.artifact_type: Optional[str] = None
|
74 |
|
75 |
logger.info(f"Initialized HuggingFacePusher for {self.repo_id}")
|
76 |
|
|
|
135 |
logger.error(f"❌ Failed to create repository: {e}")
|
136 |
return False
|
137 |
|
138 |
+
def _detect_artifact_type(self) -> str:
|
139 |
+
"""Detect whether output dir contains a full model or a LoRA adapter."""
|
140 |
+
# LoRA artifacts
|
141 |
+
lora_candidates = [
|
142 |
+
self.model_path / "adapter_config.json",
|
143 |
+
self.model_path / "adapter_model.safetensors",
|
144 |
+
self.model_path / "adapter_model.bin",
|
145 |
]
|
146 |
+
if any(p.exists() for p in lora_candidates) and (self.model_path / "adapter_config.json").exists():
|
147 |
+
return "lora"
|
148 |
+
|
149 |
+
# Full model artifacts
|
150 |
+
full_candidates = [
|
151 |
+
self.model_path / "config.json",
|
152 |
+
self.model_path / "model.safetensors",
|
153 |
+
self.model_path / "model.safetensors.index.json",
|
154 |
+
self.model_path / "pytorch_model.bin",
|
155 |
]
|
156 |
+
if any(p.exists() for p in full_candidates):
|
157 |
+
return "full"
|
158 |
+
|
159 |
+
return "unknown"
|
160 |
+
|
161 |
+
def validate_model_path(self) -> bool:
|
162 |
+
"""Validate that the model path contains required files for Voxtral full or LoRA."""
|
163 |
+
self.artifact_type = self._detect_artifact_type()
|
164 |
+
if self.artifact_type == "lora":
|
165 |
+
required = [self.model_path / "adapter_config.json"]
|
166 |
+
if not all(p.exists() for p in required):
|
167 |
+
logger.error("❌ LoRA artifacts missing required files (adapter_config.json)")
|
168 |
+
return False
|
169 |
+
# At least one adapter weight
|
170 |
+
if not ((self.model_path / "adapter_model.safetensors").exists() or (self.model_path / "adapter_model.bin").exists()):
|
171 |
+
logger.error("❌ LoRA artifacts missing adapter weights (adapter_model.safetensors or adapter_model.bin)")
|
172 |
+
return False
|
173 |
+
logger.info("✅ Detected LoRA adapter artifacts")
|
174 |
+
return True
|
175 |
+
|
176 |
+
if self.artifact_type == "full":
|
177 |
+
# Relaxed set: require config.json and at least one model weights file
|
178 |
+
if not (self.model_path / "config.json").exists():
|
179 |
+
logger.error("❌ Missing config.json in model directory")
|
180 |
+
return False
|
181 |
+
if not ((self.model_path / "model.safetensors").exists() or (self.model_path / "model.safetensors.index.json").exists() or (self.model_path / "pytorch_model.bin").exists()):
|
182 |
+
logger.error("❌ Missing model weights file (model.safetensors or pytorch_model.bin)")
|
183 |
+
return False
|
184 |
+
logger.info("✅ Detected full model artifacts")
|
185 |
+
return True
|
186 |
+
|
187 |
+
logger.error("❌ Could not detect model artifacts (neither full model nor LoRA)")
|
188 |
+
return False
|
189 |
|
190 |
def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
|
191 |
"""Create a comprehensive model card using the generate_model_card.py script"""
|
|
|
237 |
return self._create_simple_model_card(training_config, results)
|
238 |
|
239 |
def _create_simple_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
|
240 |
+
"""Create a simple model card tailored for Voxtral ASR (supports full and LoRA)."""
|
241 |
+
tags = ["voxtral", "asr", "speech-to-text", "fine-tuning"]
|
242 |
+
if self.artifact_type == "lora":
|
243 |
+
tags.append("lora")
|
244 |
+
front_matter = {
|
245 |
+
"license": "apache-2.0",
|
246 |
+
"tags": tags,
|
247 |
+
"pipeline_tag": "automatic-speech-recognition",
|
248 |
+
}
|
249 |
+
fm_yaml = "---\n" + "\n".join([
|
250 |
+
"license: apache-2.0",
|
251 |
+
"tags:",
|
252 |
+
]) + "\n" + "\n".join([f"- {t}" for t in tags]) + "\n" + "pipeline_tag: automatic-speech-recognition\n---\n\n"
|
253 |
+
model_title = self.repo_id.split('/')[-1]
|
254 |
+
body = [
|
255 |
+
f"# {model_title}",
|
256 |
+
"",
|
257 |
+
("This repository contains a LoRA adapter for Voxtral ASR. "
|
258 |
+
"Merge the adapter with the base model or load via PEFT for inference." if self.artifact_type == "lora" else
|
259 |
+
"This repository contains a fine-tuned Voxtral ASR model."),
|
260 |
+
"",
|
261 |
+
"## Usage",
|
262 |
+
"",
|
263 |
+
("```python\nfrom transformers import AutoProcessor\nfrom peft import PeftModel\nfrom transformers import AutoModelForSeq2SeqLM\n\nbase_model_id = 'mistralai/Voxtral-Mini-3B-2507'\nprocessor = AutoProcessor.from_pretrained(base_model_id)\nbase_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id)\nmodel = PeftModel.from_pretrained(base_model, '{self.repo_id}')\n```" if self.artifact_type == "lora" else
|
264 |
+
f"""```python
|
265 |
+
from transformers import AutoProcessor, AutoModelForSeq2SeqLM
|
266 |
+
|
267 |
+
processor = AutoProcessor.from_pretrained("{self.repo_id}")
|
268 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("{self.repo_id}")
|
269 |
+
```"""),
|
270 |
+
"",
|
271 |
+
"## Training Configuration",
|
272 |
+
"",
|
273 |
+
f"```json\n{json.dumps(training_config or {}, indent=2)}\n```",
|
274 |
+
"",
|
275 |
+
"## Training Results",
|
276 |
+
"",
|
277 |
+
f"```json\n{json.dumps(results or {}, indent=2)}\n```",
|
278 |
+
"",
|
279 |
+
f"**Hardware**: {self._get_hardware_info()}",
|
280 |
+
]
|
281 |
+
return fm_yaml + "\n".join(body)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
282 |
|
283 |
def _get_model_size(self) -> float:
|
284 |
"""Get model size in GB"""
|
templates/model_card.md
CHANGED
@@ -5,12 +5,10 @@ language:
|
|
5 |
license: apache-2.0
|
6 |
library_name: transformers
|
7 |
tags:
|
8 |
-
-
|
9 |
- fine-tuned
|
10 |
-
- causal-lm
|
11 |
- text-generation
|
12 |
- tonic
|
13 |
-
- legml
|
14 |
{{#if quantized_models}}- quantized{{/if}}
|
15 |
pipeline_tag: text-generation
|
16 |
base_model: {{base_model}}
|
|
|
5 |
license: apache-2.0
|
6 |
library_name: transformers
|
7 |
tags:
|
8 |
+
- voxtral
|
9 |
- fine-tuned
|
|
|
10 |
- text-generation
|
11 |
- tonic
|
|
|
12 |
{{#if quantized_models}}- quantized{{/if}}
|
13 |
pipeline_tag: text-generation
|
14 |
base_model: {{base_model}}
|
templates/spaces/demo_voxtral/README.md
CHANGED
@@ -12,12 +12,24 @@ short_description: Interactive ASR demo for a fine-tuned Voxtral model
|
|
12 |
This Space serves a Voxtral ASR model for speech-to-text transcription.
|
13 |
Usage:
|
14 |
|
15 |
-
-
|
16 |
-
-
|
17 |
-
-
|
|
|
18 |
|
19 |
Environment variables expected:
|
20 |
|
21 |
- `HF_MODEL_ID`: The model repo to load (e.g., `username/voxtral-finetune-YYYYMMDD_HHMMSS`)
|
22 |
- `MODEL_NAME`: Display name
|
23 |
- `HF_USERNAME`: For branding
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
This Space serves a Voxtral ASR model for speech-to-text transcription.
|
13 |
Usage:
|
14 |
|
15 |
+
- Select a language (or leave on Auto for detection).
|
16 |
+
- Upload an audio file or record via microphone.
|
17 |
+
- Click Transcribe to see the transcription.
|
18 |
+
- Works best with standard speech audio; Voxtral handles language detection by default.
|
19 |
|
20 |
Environment variables expected:
|
21 |
|
22 |
- `HF_MODEL_ID`: The model repo to load (e.g., `username/voxtral-finetune-YYYYMMDD_HHMMSS`)
|
23 |
- `MODEL_NAME`: Display name
|
24 |
- `HF_USERNAME`: For branding
|
25 |
+
- `MODEL_SUBFOLDER`: Optional subfolder in the repo (e.g., `int4`) for quantized/packed weights
|
26 |
+
|
27 |
+
Supported languages:
|
28 |
+
|
29 |
+
- English, French, German, Spanish, Italian, Portuguese, Dutch, Hindi
|
30 |
+
- Or choose Auto to let the model detect the language
|
31 |
+
|
32 |
+
Notes:
|
33 |
+
|
34 |
+
- Uses bfloat16 on GPU and float32 on CPU.
|
35 |
+
- Decodes only newly generated tokens for clean transcriptions.
|
templates/spaces/demo_voxtral/app.py
CHANGED
@@ -1,33 +1,100 @@
|
|
1 |
import os
|
2 |
import gradio as gr
|
3 |
import torch
|
4 |
-
from transformers import AutoProcessor
|
|
|
|
|
|
|
|
|
|
|
5 |
|
6 |
HF_MODEL_ID = os.getenv("HF_MODEL_ID", "mistralai/Voxtral-Mini-3B-2507")
|
7 |
MODEL_NAME = os.getenv("MODEL_NAME", HF_MODEL_ID.split("/")[-1])
|
8 |
HF_USERNAME = os.getenv("HF_USERNAME", "")
|
|
|
9 |
|
10 |
-
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
return "No audio provided"
|
16 |
-
|
17 |
-
inputs
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
with torch.no_grad():
|
20 |
-
output_ids = model.generate(**inputs, max_new_tokens=
|
21 |
-
#
|
22 |
-
|
23 |
-
|
|
|
|
|
24 |
|
25 |
with gr.Blocks() as demo:
|
26 |
gr.Markdown(f"# 🎙️ Voxtral ASR Demo — {MODEL_NAME}")
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
btn = gr.Button("Transcribe")
|
29 |
-
out = gr.Textbox(label="Transcription", lines=
|
30 |
-
btn.click(transcribe, inputs=[audio], outputs=[out])
|
31 |
|
32 |
if __name__ == "__main__":
|
33 |
demo.launch(mcp_server=True, ssr_mode=False)
|
|
|
1 |
import os
|
2 |
import gradio as gr
|
3 |
import torch
|
4 |
+
from transformers import AutoProcessor
|
5 |
+
try:
|
6 |
+
from transformers import VoxtralForConditionalGeneration as VoxtralModelClass
|
7 |
+
except Exception:
|
8 |
+
# Fallback for older transformers versions
|
9 |
+
from transformers import AutoModelForSeq2SeqLM as VoxtralModelClass
|
10 |
|
11 |
HF_MODEL_ID = os.getenv("HF_MODEL_ID", "mistralai/Voxtral-Mini-3B-2507")
|
12 |
MODEL_NAME = os.getenv("MODEL_NAME", HF_MODEL_ID.split("/")[-1])
|
13 |
HF_USERNAME = os.getenv("HF_USERNAME", "")
|
14 |
+
MODEL_SUBFOLDER = os.getenv("MODEL_SUBFOLDER", "").strip()
|
15 |
|
16 |
+
try:
|
17 |
+
processor = AutoProcessor.from_pretrained(HF_MODEL_ID)
|
18 |
+
except Exception:
|
19 |
+
# Fallback: some repos may store processor files inside the subfolder
|
20 |
+
if MODEL_SUBFOLDER:
|
21 |
+
processor = AutoProcessor.from_pretrained(HF_MODEL_ID, subfolder=MODEL_SUBFOLDER)
|
22 |
+
else:
|
23 |
+
raise
|
24 |
|
25 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
26 |
+
# Use float32 on CPU; bfloat16 on CUDA if available
|
27 |
+
if torch.cuda.is_available():
|
28 |
+
model_kwargs = {"device_map": "auto", "torch_dtype": torch.bfloat16}
|
29 |
+
else:
|
30 |
+
model_kwargs = {"torch_dtype": torch.float32}
|
31 |
+
|
32 |
+
if MODEL_SUBFOLDER:
|
33 |
+
model = VoxtralModelClass.from_pretrained(
|
34 |
+
HF_MODEL_ID, subfolder=MODEL_SUBFOLDER, **model_kwargs
|
35 |
+
)
|
36 |
+
else:
|
37 |
+
model = VoxtralModelClass.from_pretrained(
|
38 |
+
HF_MODEL_ID, **model_kwargs
|
39 |
+
)
|
40 |
+
|
41 |
+
# Simple language options (with Auto detection)
|
42 |
+
LANGUAGES = {
|
43 |
+
"Auto": "auto",
|
44 |
+
"English": "en",
|
45 |
+
"French": "fr",
|
46 |
+
"German": "de",
|
47 |
+
"Spanish": "es",
|
48 |
+
"Italian": "it",
|
49 |
+
"Portuguese": "pt",
|
50 |
+
"Dutch": "nl",
|
51 |
+
"Hindi": "hi",
|
52 |
+
}
|
53 |
+
|
54 |
+
MAX_NEW_TOKENS = 1024
|
55 |
+
|
56 |
+
def transcribe(sel_language, audio_path):
|
57 |
+
if audio_path is None:
|
58 |
return "No audio provided"
|
59 |
+
language_code = LANGUAGES.get(sel_language, "auto")
|
60 |
+
# Build Voxtral transcription inputs from filepath and selected language
|
61 |
+
if hasattr(processor, "apply_transcrition_request"):
|
62 |
+
inputs = processor.apply_transcrition_request(
|
63 |
+
language=language_code,
|
64 |
+
audio=audio_path,
|
65 |
+
model_id=HF_MODEL_ID,
|
66 |
+
)
|
67 |
+
else:
|
68 |
+
# Compatibility with potential corrected naming
|
69 |
+
inputs = processor.apply_transcription_request(
|
70 |
+
language=language_code,
|
71 |
+
audio=audio_path,
|
72 |
+
model_id=HF_MODEL_ID,
|
73 |
+
)
|
74 |
+
# Move to device with appropriate dtype
|
75 |
+
inputs = inputs.to(device, dtype=(torch.bfloat16 if device == "cuda" else torch.float32))
|
76 |
with torch.no_grad():
|
77 |
+
output_ids = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
|
78 |
+
# Decode only newly generated tokens (beyond the prompt length)
|
79 |
+
decoded = processor.batch_decode(
|
80 |
+
output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
|
81 |
+
)
|
82 |
+
return decoded[0]
|
83 |
|
84 |
with gr.Blocks() as demo:
|
85 |
gr.Markdown(f"# 🎙️ Voxtral ASR Demo — {MODEL_NAME}")
|
86 |
+
with gr.Row():
|
87 |
+
language = gr.Dropdown(
|
88 |
+
choices=list(LANGUAGES.keys()), value="Auto", label="Language"
|
89 |
+
)
|
90 |
+
audio = gr.Audio(
|
91 |
+
sources=["upload", "microphone"],
|
92 |
+
type="filepath",
|
93 |
+
label="Upload or record audio",
|
94 |
+
)
|
95 |
btn = gr.Button("Transcribe")
|
96 |
+
out = gr.Textbox(label="Transcription", lines=8)
|
97 |
+
btn.click(transcribe, inputs=[language, audio], outputs=[out])
|
98 |
|
99 |
if __name__ == "__main__":
|
100 |
demo.launch(mcp_server=True, ssr_mode=False)
|