Joseph Pollack commited on
Commit
a595d5a
·
unverified ·
1 Parent(s): a3a3978

improves demo for automatic deployment and interface linking to deployment scripts

Browse files
docs/blog-accessibility.md ADDED
@@ -0,0 +1,497 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Accessible Speech Recognition: Fine‑tune Voxtral on Your Own Voice
2
+
3
+ Building speech technology that understands everyone is an accessibility imperative. If you have a speech impediment (e.g., stutter, dysarthria, apraxia) or a heavy accent, mainstream ASR systems can struggle. This app lets you fine‑tune the Voxtral ASR model on your own voice so it adapts to your unique speaking style — improving recognition accuracy and unlocking more inclusive voice experiences.
4
+
5
+ ## Who this helps
6
+
7
+ - **People with speech differences**: Personalized models that reduce error rates on your voice
8
+ - **Accented speakers**: Adapt Voxtral to your accent and vocabulary
9
+ - **Educators/clinicians**: Create tailored recognition models for communication support
10
+ - **Product teams**: Prototype inclusive voice features with real users quickly
11
+
12
+ ## What you get
13
+
14
+ - **Record or upload audio** and create a JSONL dataset in a few clicks
15
+ - **One‑click training** with full fine‑tuning or LoRA for efficiency
16
+ - **Automatic publishing** to Hugging Face Hub with a generated model card
17
+ - **Instant demo deployment** to HF Spaces for shareable, live ASR
18
+
19
+ ## How it works (at a glance)
20
+
21
+ ```mermaid
22
+ graph TD
23
+ %% Main Entry Point
24
+ START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
25
+
26
+ %% Documentation Categories
27
+ OVERVIEW --> ARCH[🏗️ Architecture Overview]
28
+ OVERVIEW --> WORKFLOW[🔄 Interface Workflow]
29
+ OVERVIEW --> TRAINING[🚀 Training Pipeline]
30
+ OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
31
+ OVERVIEW --> DATAFLOW[📊 Data Flow]
32
+
33
+ %% Architecture Section
34
+ ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
35
+ ARCH --> ARCH_LINK[📄 View Details →](architecture.md)
36
+
37
+ %% Interface Section
38
+ WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording → Training → Demo]
39
+ WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md)
40
+
41
+ %% Training Section
42
+ TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data → Model → Results]
43
+ TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md)
44
+
45
+ %% Deployment Section
46
+ DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model → Hub → Space]
47
+ DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md)
48
+
49
+ %% Data Flow Section
50
+ DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input → Processing → Output]
51
+ DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md)
52
+
53
+ %% Key Components Highlight
54
+ subgraph "🎛️ Core Components"
55
+ INTERFACE[interface.py<br/>Gradio Web UI]
56
+ TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
57
+ DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
58
+ PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
59
+ end
60
+
61
+ %% Data Flow Highlight
62
+ subgraph "📁 Key Data Formats"
63
+ JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
64
+ HFDATA[HF Hub Models<br/>username/model-name]
65
+ SPACES[HF Spaces<br/>Interactive Demos]
66
+ end
67
+
68
+ %% Connect components to their respective docs
69
+ INTERFACE --> WORKFLOW
70
+ TRAIN_SCRIPTS --> TRAINING
71
+ DEPLOY_SCRIPT --> DEPLOYMENT
72
+ PUSH_SCRIPT --> DEPLOYMENT
73
+
74
+ JSONL --> DATAFLOW
75
+ HFDATA --> DEPLOYMENT
76
+ SPACES --> DEPLOYMENT
77
+
78
+ %% Styling
79
+ classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
80
+ classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
81
+ classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
82
+ classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
83
+ classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
84
+ classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
85
+
86
+ class START entry
87
+ class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
88
+ class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
89
+ class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
90
+ class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
91
+ class JSONL,HFDATA,SPACES data
92
+ ```
93
+
94
+ See the interactive diagram page for printing and quick navigation: [Interactive diagrams](diagrams.html).
95
+
96
+ ## Quick start
97
+
98
+ ### 1) Install
99
+
100
+ ```bash
101
+ git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
102
+ cd Finetune-Voxtral-ASR
103
+ ```
104
+
105
+ Use UV (recommended) or pip.
106
+
107
+ ```bash
108
+ # UV
109
+ uv venv .venv --python 3.10 && source .venv/bin/activate
110
+ uv pip install -r requirements.txt
111
+
112
+ # or pip
113
+ python -m venv .venv --python 3.10 && source .venv/bin/activate
114
+ pip install --upgrade pip
115
+ pip install -r requirements.txt
116
+ ```
117
+
118
+ ### 2) Launch the interface
119
+
120
+ ```bash
121
+ python interface.py
122
+ ```
123
+
124
+ The Gradio app guides you through language selection, recording or uploading audio, dataset creation, and training.
125
+
126
+ ## Create your voice dataset (UI)
127
+
128
+ ```mermaid
129
+ stateDiagram-v2
130
+ [*] --> LanguageSelection: User opens interface
131
+
132
+ state "Language & Dataset Setup" as LangSetup {
133
+ [*] --> LanguageSelection
134
+ LanguageSelection --> LoadPhrases: Select language
135
+ LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
136
+ DisplayPhrases --> RecordingInterface: Show phrases & recording UI
137
+
138
+ state RecordingInterface {
139
+ [*] --> ShowInitialRows: Display first 10 phrases
140
+ ShowInitialRows --> RecordAudio: User can record audio
141
+ RecordAudio --> AddMoreRows: Optional - add 10 more rows
142
+ AddMoreRows --> RecordAudio
143
+ }
144
+ }
145
+
146
+ RecordingInterface --> DatasetCreation: User finishes recording
147
+
148
+ state "Dataset Creation Options" as DatasetCreation {
149
+ [*] --> FromRecordings: Create from recorded audio
150
+ [*] --> FromUploads: Upload existing files
151
+
152
+ FromRecordings --> ProcessRecordings: Save WAV files + transcripts
153
+ FromUploads --> ProcessUploads: Process uploaded files + transcripts
154
+
155
+ ProcessRecordings --> CreateJSONL: Generate JSONL dataset
156
+ ProcessUploads --> CreateJSONL
157
+
158
+ CreateJSONL --> DatasetReady: Dataset saved locally
159
+ }
160
+
161
+ DatasetCreation --> TrainingConfiguration: Dataset ready
162
+
163
+ state "Training Setup" as TrainingConfiguration {
164
+ [*] --> BasicSettings: Model, LoRA/full, batch size
165
+ [*] --> AdvancedSettings: Learning rate, epochs, LoRA params
166
+
167
+ BasicSettings --> ConfigureDeployment: Repo name, push options
168
+ AdvancedSettings --> ConfigureDeployment
169
+
170
+ ConfigureDeployment --> StartTraining: All settings configured
171
+ }
172
+
173
+ TrainingConfiguration --> TrainingProcess: Start training
174
+
175
+ state "Training Process" as TrainingProcess {
176
+ [*] --> InitializeTrackio: Setup experiment tracking
177
+ InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
178
+ RunTrainingScript --> StreamLogs: Show real-time training logs
179
+ StreamLogs --> MonitorProgress: Track metrics & checkpoints
180
+
181
+ MonitorProgress --> TrainingComplete: Training finished
182
+ MonitorProgress --> HandleErrors: Training failed
183
+ HandleErrors --> RetryOrExit: User can retry or exit
184
+ }
185
+
186
+ TrainingProcess --> PostTraining: Training complete
187
+
188
+ state "Post-Training Actions" as PostTraining {
189
+ [*] --> PushToHub: Push model to HF Hub
190
+ [*] --> GenerateModelCard: Create model card
191
+ [*] --> DeployDemoSpace: Deploy interactive demo
192
+
193
+ PushToHub --> ModelPublished: Model available on HF Hub
194
+ GenerateModelCard --> ModelDocumented: Model card created
195
+ DeployDemoSpace --> DemoReady: Demo space deployed
196
+ }
197
+
198
+ PostTraining --> [*]: Process complete
199
+
200
+ %% Alternative paths
201
+ DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
202
+ PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
203
+
204
+ %% Error handling
205
+ TrainingProcess --> ErrorRecovery: Handle training errors
206
+ ErrorRecovery --> RetryTraining: Retry with different settings
207
+ RetryTraining --> TrainingConfiguration
208
+
209
+ %% Styling and notes
210
+ note right of LanguageSelection : User selects language for\n authentic phrases from\n NVIDIA Granary dataset
211
+ note right of RecordingInterface : Users record themselves\n reading displayed phrases
212
+ note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
213
+ note right of TrainingConfiguration : Configure LoRA parameters,\n learning rate, epochs, etc.
214
+ note right of TrainingProcess : Real-time log streaming\n with Trackio integration
215
+ note right of PostTraining : Automated deployment\n pipeline
216
+ ```
217
+
218
+ Steps you’ll follow in the UI:
219
+
220
+ - **Choose language**: Select a language for authentic phrases (from NVIDIA Granary)
221
+ - **Record or upload**: Capture your voice or provide existing audio + transcripts
222
+ - **Create dataset**: The app writes a JSONL file with entries like `{ "audio_path": ..., "text": ... }`
223
+ - **Configure training**: Pick base model, LoRA vs full, batch size and learning rate
224
+ - **Run training**: Watch live logs and metrics; resume on error if needed
225
+ - **Publish & deploy**: Push to HF Hub and one‑click deploy an interactive Space
226
+
227
+ ## Train your personalized Voxtral model
228
+
229
+ Under the hood, training uses Hugging Face Trainer and a custom `VoxtralDataCollator` that builds Voxtral/LLaMA‑style prompts and masks the prompt tokens so loss is computed only on the transcription.
230
+
231
+ ```mermaid
232
+ graph TB
233
+ %% Input Data Sources
234
+ subgraph "Data Sources"
235
+ JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
236
+ GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
237
+ HFDATA[HF Hub Datasets<br/>Community Datasets]
238
+ end
239
+
240
+ %% Data Processing
241
+ subgraph "Data Processing"
242
+ LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
243
+ CASTER[Audio Casting<br/>16kHz resampling]
244
+ COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
245
+ end
246
+
247
+ %% Training Scripts
248
+ subgraph "Training Scripts"
249
+ TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
250
+ TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]
251
+
252
+ subgraph "Training Components"
253
+ MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
254
+ LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
255
+ PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
256
+ end
257
+ end
258
+
259
+ %% Training Infrastructure
260
+ subgraph "Training Infrastructure"
261
+ TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
262
+ HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
263
+ TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
264
+ end
265
+
266
+ %% Training Process
267
+ subgraph "Training Process"
268
+ FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
269
+ LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
270
+ BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
271
+ OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
272
+ LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
273
+ end
274
+
275
+ %% Model Management
276
+ subgraph "Model Management"
277
+ CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
278
+ MODEL_SAVING[Final Model Saving<br/>Processor + Model]
279
+ LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
280
+ end
281
+
282
+ %% Flow Connections
283
+ JSONL --> LOADER
284
+ GRANARY --> LOADER
285
+ HFDATA --> LOADER
286
+
287
+ LOADER --> CASTER
288
+ CASTER --> COLLATOR
289
+
290
+ COLLATOR --> TRAIN_FULL
291
+ COLLATOR --> TRAIN_LORA
292
+
293
+ TRAIN_FULL --> MODEL_INIT
294
+ TRAIN_LORA --> MODEL_INIT
295
+ TRAIN_LORA --> LORA_CONFIG
296
+
297
+ MODEL_INIT --> PROCESSOR_INIT
298
+ LORA_CONFIG --> PROCESSOR_INIT
299
+
300
+ PROCESSOR_INIT --> TRACKIO_INIT
301
+ PROCESSOR_INIT --> HF_TRAINER
302
+ PROCESSOR_INIT --> TORCH_DEVICE
303
+
304
+ TRACKIO_INIT --> HF_TRAINER
305
+ TORCH_DEVICE --> HF_TRAINER
306
+
307
+ HF_TRAINER --> FORWARD_PASS
308
+ FORWARD_PASS --> LOSS_CALC
309
+ LOSS_CALC --> BACKWARD_PASS
310
+ BACKWARD_PASS --> OPTIMIZER_STEP
311
+ OPTIMIZER_STEP --> LOGGING
312
+
313
+ LOGGING --> CHECKPOINT_SAVING
314
+ LOGGING --> TRACKIO_INIT
315
+
316
+ HF_TRAINER --> MODEL_SAVING
317
+ MODEL_SAVING --> LOCAL_STORAGE
318
+
319
+ %% Styling
320
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
321
+ classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
322
+ classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
323
+ classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
324
+ classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
325
+ classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
326
+
327
+ class JSONL,GRANARY,HFDATA input
328
+ class LOADER,CASTER,COLLATOR processing
329
+ class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
330
+ class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
331
+ class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
332
+ class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
333
+ ```
334
+
335
+ CLI alternatives (if you prefer the terminal):
336
+
337
+ ```bash
338
+ # Full fine-tuning
339
+ uv run train.py
340
+
341
+ # Parameter‑efficient LoRA fine‑tuning (recommended for most users)
342
+ uv run train_lora.py
343
+ ```
344
+
345
+ ## Publish and deploy a live demo
346
+
347
+ After training, the app can push your model and metrics to the Hugging Face Hub and create an interactive Space demo automatically.
348
+
349
+ ```mermaid
350
+ graph TB
351
+ %% Input Sources
352
+ subgraph "Inputs"
353
+ TRAINED_MODEL[Trained Model<br/>Local directory]
354
+ TRAINING_CONFIG[Training Config<br/>JSON/YAML]
355
+ TRAINING_RESULTS[Training Results<br/>Metrics & logs]
356
+ MODEL_METADATA[Model Metadata<br/>Name, description, etc.]
357
+ end
358
+
359
+ %% Model Publishing
360
+ subgraph "Model Publishing"
361
+ PUSH_SCRIPT[push_to_huggingface.py<br/>Model Publisher]
362
+
363
+ subgraph "Publishing Steps"
364
+ REPO_CREATION[Repository Creation<br/>HF Hub API]
365
+ FILE_UPLOAD[File Upload<br/>Model files to HF]
366
+ METADATA_UPLOAD[Metadata Upload<br/>Config & results]
367
+ end
368
+ end
369
+
370
+ %% Model Card Generation
371
+ subgraph "Model Card Generation"
372
+ CARD_SCRIPT[generate_model_card.py<br/>Card Generator]
373
+
374
+ subgraph "Card Components"
375
+ TEMPLATE_LOAD[Template Loading<br/>model_card.md]
376
+ VARIABLE_REPLACEMENT[Variable Replacement<br/>Config injection]
377
+ CONDITIONAL_PROCESSING[Conditional Sections<br/>Quantized models, etc.]
378
+ end
379
+ end
380
+
381
+ %% Demo Space Deployment
382
+ subgraph "Demo Space Deployment"
383
+ DEPLOY_SCRIPT[deploy_demo_space.py<br/>Space Deployer]
384
+
385
+ subgraph "Space Setup"
386
+ SPACE_CREATION[Space Repository<br/>Create HF Space]
387
+ TEMPLATE_COPY[Template Copying<br/>demo_voxtral/ files]
388
+ ENV_INJECTION[Environment Setup<br/>Model config injection]
389
+ SECRET_SETUP[Secret Configuration<br/>HF_TOKEN, model vars]
390
+ end
391
+ end
392
+
393
+ %% Space Building & Testing
394
+ subgraph "Space Building"
395
+ BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
396
+ DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
397
+ MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
398
+ APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
399
+ end
400
+
401
+ %% Live Demo
402
+ subgraph "Live Demo Space"
403
+ GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
404
+ MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
405
+ USER_INTERACTION[User Interaction<br/>Audio upload/playback]
406
+ end
407
+
408
+ %% External Services
409
+ subgraph "External Services"
410
+ HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
411
+ HF_SPACES[HF Spaces Platform<br/>Demo hosting]
412
+ end
413
+
414
+ %% Flow Connections
415
+ TRAINED_MODEL --> PUSH_SCRIPT
416
+ TRAINING_CONFIG --> PUSH_SCRIPT
417
+ TRAINING_RESULTS --> PUSH_SCRIPT
418
+ MODEL_METADATA --> PUSH_SCRIPT
419
+
420
+ PUSH_SCRIPT --> REPO_CREATION
421
+ REPO_CREATION --> FILE_UPLOAD
422
+ FILE_UPLOAD --> METADATA_UPLOAD
423
+
424
+ METADATA_UPLOAD --> CARD_SCRIPT
425
+ TRAINING_CONFIG --> CARD_SCRIPT
426
+ TRAINING_RESULTS --> CARD_SCRIPT
427
+
428
+ CARD_SCRIPT --> TEMPLATE_LOAD
429
+ TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
430
+ VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
431
+
432
+ CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
433
+ METADATA_UPLOAD --> DEPLOY_SCRIPT
434
+
435
+ DEPLOY_SCRIPT --> SPACE_CREATION
436
+ SPACE_CREATION --> TEMPLATE_COPY
437
+ TEMPLATE_COPY --> ENV_INJECTION
438
+ ENV_INJECTION --> SECRET_SETUP
439
+
440
+ SECRET_SETUP --> BUILD_TRIGGER
441
+ BUILD_TRIGGER --> DEPENDENCY_INSTALL
442
+ DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
443
+ MODEL_DOWNLOAD --> APP_INITIALIZATION
444
+
445
+ APP_INITIALIZATION --> GRADIO_INTERFACE
446
+ GRADIO_INTERFACE --> MODEL_INFERENCE
447
+ MODEL_INFERENCE --> USER_INTERACTION
448
+
449
+ HF_HUB --> MODEL_DOWNLOAD
450
+ HF_SPACES --> GRADIO_INTERFACE
451
+
452
+ %% Styling
453
+ classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
454
+ classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
455
+ classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
456
+ classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
457
+ classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
458
+ classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
459
+ classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
460
+
461
+ class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
462
+ class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
463
+ class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
464
+ class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
465
+ class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
466
+ class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
467
+ class HF_HUB,HF_SPACES external
468
+ ```
469
+
470
+ ## Why personalization improves accessibility
471
+
472
+ - **Your model learns your patterns**: tempo, prosody, phoneme realizations, disfluencies
473
+ - **Vocabulary and names**: teach domain terms and proper nouns you use often
474
+ - **Bias correction**: reduce systematic errors common to off‑the‑shelf ASR for your voice
475
+ - **Agency and privacy**: keep data local and only publish when you choose
476
+
477
+ ## Practical tips
478
+
479
+ - **Start with LoRA**: Parameter‑efficient fine‑tuning is faster and uses less memory
480
+ - **Record diverse samples**: Different tempos, environments, and phrase lengths
481
+ - **Short sessions**: Many shorter clips beat a few long ones for learning
482
+ - **Check transcripts**: Clean, accurate transcripts improve outcomes
483
+
484
+ ## Learn more
485
+
486
+ - [Repository README](../README.md)
487
+ - [Documentation Overview](README.md)
488
+ - [Architecture Overview](architecture.md)
489
+ - [Interface Workflow](interface-workflow.md)
490
+ - [Training Pipeline](training-pipeline.md)
491
+ - [Deployment Pipeline](deployment-pipeline.md)
492
+ - [Data Flow](data-flow.md)
493
+ - [Interactive Diagrams](diagrams.html)
494
+
495
+ ---
496
+
497
+ This project exists to make voice technology work better for everyone. If you build a model that helps you — or your community — consider sharing a demo so others can learn from it.
interface.py CHANGED
@@ -745,15 +745,15 @@ with gr.Blocks(title="Voxtral ASR Fine-tuning") as demo:
745
  def _collect_upload(files, txt):
746
  lines = [s.strip() for s in (txt or "").splitlines() if s.strip()]
747
  jsonl_path = _save_uploaded_dataset(files or [], lines)
748
- return f"✅ Dataset saved locally: {jsonl_path}"
749
 
750
- def _push_dataset_handler(repo_name):
751
- if not jsonl_path_state.value:
752
  return "❌ No dataset saved yet. Please save dataset first."
753
- return _push_dataset_to_hub(jsonl_path_state.value, repo_name)
754
 
755
- save_upload_btn.click(_collect_upload, [upload_audio, transcripts_box], [jsonl_path_state])
756
- push_dataset_btn.click(_push_dataset_handler, [dataset_repo_name], [jsonl_path_state])
757
 
758
  # Save recordings button
759
  save_rec_btn = gr.Button("Save recordings as dataset", visible=False)
@@ -782,16 +782,16 @@ with gr.Blocks(title="Voxtral ASR Fine-tuning") as demo:
782
  rows.append({"audio_path": str(out_path), "text": label_text})
783
  jsonl_path = dataset_dir / "data.jsonl"
784
  _write_jsonl(rows, jsonl_path)
785
- return str(jsonl_path)
786
 
787
- save_rec_btn.click(_collect_preloaded_recs, rec_components + [phrase_texts_state], [jsonl_path_state])
788
 
789
- def _push_recordings_handler(repo_name):
790
- if not jsonl_path_state.value:
791
  return "❌ No recordings dataset saved yet. Please save recordings first."
792
- return _push_dataset_to_hub(jsonl_path_state.value, repo_name)
793
 
794
- push_recordings_btn.click(_push_recordings_handler, [dataset_repo_name], [jsonl_path_state])
795
 
796
  # Removed multilingual dataset sample section - phrases are now loaded automatically when language is selected
797
 
 
745
  def _collect_upload(files, txt):
746
  lines = [s.strip() for s in (txt or "").splitlines() if s.strip()]
747
  jsonl_path = _save_uploaded_dataset(files or [], lines)
748
+ return str(jsonl_path), f"✅ Dataset saved locally: {jsonl_path}"
749
 
750
+ def _push_dataset_handler(repo_name, current_jsonl_path):
751
+ if not current_jsonl_path:
752
  return "❌ No dataset saved yet. Please save dataset first."
753
+ return _push_dataset_to_hub(current_jsonl_path, repo_name)
754
 
755
+ save_upload_btn.click(_collect_upload, [upload_audio, transcripts_box], [jsonl_path_state, dataset_status])
756
+ push_dataset_btn.click(_push_dataset_handler, [dataset_repo_name, jsonl_path_state], [dataset_status])
757
 
758
  # Save recordings button
759
  save_rec_btn = gr.Button("Save recordings as dataset", visible=False)
 
782
  rows.append({"audio_path": str(out_path), "text": label_text})
783
  jsonl_path = dataset_dir / "data.jsonl"
784
  _write_jsonl(rows, jsonl_path)
785
+ return str(jsonl_path), f"✅ Dataset saved locally: {jsonl_path}"
786
 
787
+ save_rec_btn.click(_collect_preloaded_recs, rec_components + [phrase_texts_state], [jsonl_path_state, dataset_status])
788
 
789
+ def _push_recordings_handler(repo_name, current_jsonl_path):
790
+ if not current_jsonl_path:
791
  return "❌ No recordings dataset saved yet. Please save recordings first."
792
+ return _push_dataset_to_hub(current_jsonl_path, repo_name)
793
 
794
+ push_recordings_btn.click(_push_recordings_handler, [dataset_repo_name, jsonl_path_state], [dataset_status])
795
 
796
  # Removed multilingual dataset sample section - phrases are now loaded automatically when language is selected
797
 
scripts/deploy_demo_space.py CHANGED
@@ -25,11 +25,9 @@ except ImportError:
25
  HF_HUB_AVAILABLE = False
26
  print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
27
 
28
- # Add src to path for imports
29
  sys.path.append(str(Path(__file__).parent.parent / "src"))
30
 
31
- from config import SmolLM3Config
32
-
33
  # Setup logging
34
  logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
35
  logger = logging.getLogger(__name__)
@@ -223,14 +221,9 @@ os.environ['BRAND_PROJECT_URL'] = {_json.dumps(self.brand_project_url)}
223
 
224
  """
225
  elif self.demo_type == "voxtral":
226
- import json as _json
227
- env_setup = f"""
228
- # Environment variables for Voxtral ASR demo
229
- import os
230
- os.environ['HF_MODEL_ID'] = {_json.dumps(self.model_id)}
231
- os.environ['MODEL_NAME'] = {_json.dumps(self.model_id.split('/')[-1])}
232
- os.environ['HF_USERNAME'] = {_json.dumps(self.hf_username)}
233
- """
234
  else:
235
  # For SmolLM models, use simpler setup
236
  import json as _json
@@ -534,80 +527,80 @@ os.environ['BRAND_PROJECT_URL'] = {_json.dumps(self.brand_project_url)}
534
  copied_files.append(file_path.name)
535
  logger.info(f"✅ Copied {file_path.name} to temp directory")
536
 
537
- # Update app.py with environment variables
538
  app_file = Path(temp_dir) / "app.py"
539
- if app_file.exists():
540
  with open(app_file, 'r', encoding='utf-8') as f:
541
  content = f.read()
542
-
543
- # Add environment variable setup at the top
544
  env_setup = self._generate_env_setup()
545
-
546
- # Insert after imports
547
- lines = content.split('\n')
548
- import_end = 0
549
- for i, line in enumerate(lines):
550
- if line.startswith('import ') or line.startswith('from '):
551
- import_end = i + 1
552
- elif line.strip() == '' and import_end > 0:
553
- break
554
-
555
- lines.insert(import_end, env_setup)
556
- content = '\n'.join(lines)
557
-
558
- with open(app_file, 'w', encoding='utf-8') as f:
559
- f.write(content)
560
-
561
- logger.info("✅ Updated app.py with model configuration")
562
-
563
- # YAML front matter required by Hugging Face Spaces
564
- yaml_front_matter = (
565
- f"---\n"
566
- f"title: {'GPT-OSS Demo' if self.demo_type == 'gpt' else 'SmolLM3 Demo'}\n"
567
- f"emoji: {'🌟' if self.demo_type == 'gpt' else '💃🏻'}\n"
568
- f"colorFrom: {'blue' if self.demo_type == 'gpt' else 'green'}\n"
569
- f"colorTo: {'pink' if self.demo_type == 'gpt' else 'purple'}\n"
570
- f"sdk: gradio\n"
571
- f"sdk_version: 5.40.0\n"
572
- f"app_file: app.py\n"
573
- f"pinned: false\n"
574
- f"short_description: Interactive demo for {self.model_id}\n"
575
- + ("license: mit\n" if self.demo_type != 'gpt' else "") +
576
- f"---\n\n"
577
- )
578
 
579
- # Create README.md for the space (include configuration details)
580
- readme_content = (
581
- yaml_front_matter
582
- + f"# Demo: {self.model_id}\n\n"
583
- + f"This is an interactive demo for the fine-tuned model {self.model_id}.\n\n"
584
- + "## Features\n"
585
- "- Interactive chat interface\n"
586
- "- Customizable system & developer prompts\n"
587
- "- Advanced generation parameters\n"
588
- "- Thinking mode support\n\n"
589
- + "## Model Information\n"
590
- f"- **Model ID**: {self.model_id}\n"
591
- f"- **Subfolder**: {self.subfolder if self.subfolder and self.subfolder.strip() else 'main'}\n"
592
- f"- **Deployed by**: {self.hf_username}\n"
593
- + ("- **Base Model**: openai/gpt-oss-20b\n" if self.demo_type == 'gpt' else "")
594
- + "\n"
595
- + "## Configuration\n"
596
- "- **Model Identity**:\n\n"
597
- f"```\n{self.model_identity or 'Not set'}\n```\n\n"
598
- "- **System Message** (default):\n\n"
599
- f"```\n{(self.system_message or self.model_identity) or 'Not set'}\n```\n\n"
600
- "- **Developer Message** (default):\n\n"
601
- f"```\n{self.developer_message or 'Not set'}\n```\n\n"
602
- "These defaults come from the selected training configuration and can be adjusted in the UI when you run the demo.\n\n"
603
- + "## Usage\n"
604
- "Simply start chatting with the model using the interface below!\n\n"
605
- + "---\n"
606
- "*This demo was automatically deployed by the SmolFactory Fine-tuning Pipeline*\n"
607
- )
608
-
609
- with open(Path(temp_dir) / "README.md", 'w', encoding='utf-8') as f:
610
- f.write(readme_content)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611
 
612
  logger.info(f"✅ Prepared {len(copied_files)} files in temporary directory")
613
  return temp_dir
@@ -874,7 +867,7 @@ def main():
874
  parser.add_argument("--model-id", required=True, help="Model ID to deploy demo for")
875
  parser.add_argument("--subfolder", default="int4", help="Model subfolder (default: int4)")
876
  parser.add_argument("--space-name", help="Custom space name (optional)")
877
- parser.add_argument("--demo-type", choices=["smol", "gpt"], help="Demo type: 'smol' for SmolLM, 'gpt' for GPT-OSS (auto-detected if not specified)")
878
  parser.add_argument("--config-file", help="Path to the training config file to import context (system/developer/model_identity)")
879
  # Examples configuration
880
  parser.add_argument("--examples-type", choices=["general", "medical"], help="Examples pack to enable in the demo UI")
 
25
  HF_HUB_AVAILABLE = False
26
  print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
27
 
28
+ # Add src to path for imports (kept for potential future imports)
29
  sys.path.append(str(Path(__file__).parent.parent / "src"))
30
 
 
 
31
  # Setup logging
32
  logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
33
  logger = logging.getLogger(__name__)
 
221
 
222
  """
223
  elif self.demo_type == "voxtral":
224
+ # For Voxtral, we do not inject env setup into app.py.
225
+ # Space variables are set via the API in set_space_secrets().
226
+ env_setup = ""
 
 
 
 
 
227
  else:
228
  # For SmolLM models, use simpler setup
229
  import json as _json
 
527
  copied_files.append(file_path.name)
528
  logger.info(f"✅ Copied {file_path.name} to temp directory")
529
 
530
+ # Update app.py with environment variables (skip for Voxtral)
531
  app_file = Path(temp_dir) / "app.py"
532
+ if app_file.exists() and self.demo_type != "voxtral":
533
  with open(app_file, 'r', encoding='utf-8') as f:
534
  content = f.read()
535
+
 
536
  env_setup = self._generate_env_setup()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
537
 
538
+ if env_setup:
539
+ # Insert after imports
540
+ lines = content.split('\n')
541
+ import_end = 0
542
+ for i, line in enumerate(lines):
543
+ if line.startswith('import ') or line.startswith('from '):
544
+ import_end = i + 1
545
+ elif line.strip() == '' and import_end > 0:
546
+ break
547
+
548
+ lines.insert(import_end, env_setup)
549
+ content = '\n'.join(lines)
550
+
551
+ with open(app_file, 'w', encoding='utf-8') as f:
552
+ f.write(content)
553
+
554
+ logger.info(" Updated app.py with model configuration")
555
+
556
+ # For Voxtral keep the template README. For others, create a README with YAML front matter.
557
+ if self.demo_type != "voxtral":
558
+ yaml_front_matter = (
559
+ f"---\n"
560
+ f"title: {'GPT-OSS Demo' if self.demo_type == 'gpt' else 'SmolLM3 Demo'}\n"
561
+ f"emoji: {'🌟' if self.demo_type == 'gpt' else '💃🏻'}\n"
562
+ f"colorFrom: {'blue' if self.demo_type == 'gpt' else 'green'}\n"
563
+ f"colorTo: {'pink' if self.demo_type == 'gpt' else 'purple'}\n"
564
+ f"sdk: gradio\n"
565
+ f"sdk_version: 5.40.0\n"
566
+ f"app_file: app.py\n"
567
+ f"pinned: false\n"
568
+ f"short_description: Interactive demo for {self.model_id}\n"
569
+ + ("license: mit\n" if self.demo_type != 'gpt' else "") +
570
+ f"---\n\n"
571
+ )
572
+
573
+ readme_content = (
574
+ yaml_front_matter
575
+ + f"# Demo: {self.model_id}\n\n"
576
+ + f"This is an interactive demo for the fine-tuned model {self.model_id}.\n\n"
577
+ + "## Features\n"
578
+ "- Interactive chat interface\n"
579
+ "- Customizable system & developer prompts\n"
580
+ "- Advanced generation parameters\n"
581
+ "- Thinking mode support\n\n"
582
+ + "## Model Information\n"
583
+ f"- **Model ID**: {self.model_id}\n"
584
+ f"- **Subfolder**: {self.subfolder if self.subfolder and self.subfolder.strip() else 'main'}\n"
585
+ f"- **Deployed by**: {self.hf_username}\n"
586
+ + ("- **Base Model**: openai/gpt-oss-20b\n" if self.demo_type == 'gpt' else "")
587
+ + "\n"
588
+ + "## Configuration\n"
589
+ "- **Model Identity**:\n\n"
590
+ f"```\n{self.model_identity or 'Not set'}\n```\n\n"
591
+ "- **System Message** (default):\n\n"
592
+ f"```\n{(self.system_message or self.model_identity) or 'Not set'}\n```\n\n"
593
+ "- **Developer Message** (default):\n\n"
594
+ f"```\n{self.developer_message or 'Not set'}\n```\n\n"
595
+ "These defaults come from the selected training configuration and can be adjusted in the UI when you run the demo.\n\n"
596
+ + "## Usage\n"
597
+ "Simply start chatting with the model using the interface below!\n\n"
598
+ + "---\n"
599
+ "*This demo was automatically deployed by the SmolFactory Fine-tuning Pipeline*\n"
600
+ )
601
+
602
+ with open(Path(temp_dir) / "README.md", 'w', encoding='utf-8') as f:
603
+ f.write(readme_content)
604
 
605
  logger.info(f"✅ Prepared {len(copied_files)} files in temporary directory")
606
  return temp_dir
 
867
  parser.add_argument("--model-id", required=True, help="Model ID to deploy demo for")
868
  parser.add_argument("--subfolder", default="int4", help="Model subfolder (default: int4)")
869
  parser.add_argument("--space-name", help="Custom space name (optional)")
870
+ parser.add_argument("--demo-type", choices=["smol", "gpt", "voxtral"], help="Demo type: 'smol' for SmolLM, 'gpt' for GPT-OSS, 'voxtral' for Voxtral ASR (auto-detected if not specified)")
871
  parser.add_argument("--config-file", help="Path to the training config file to import context (system/developer/model_identity)")
872
  # Examples configuration
873
  parser.add_argument("--examples-type", choices=["general", "medical"], help="Examples pack to enable in the demo UI")
scripts/push_to_huggingface.py CHANGED
@@ -69,6 +69,8 @@ class HuggingFacePusher:
69
 
70
  # Resolve the full repo id (username/repo) if user only provided repo name
71
  self.repo_id = self._resolve_repo_id(self.repo_name)
 
 
72
 
73
  logger.info(f"Initialized HuggingFacePusher for {self.repo_id}")
74
 
@@ -133,37 +135,57 @@ class HuggingFacePusher:
133
  logger.error(f"❌ Failed to create repository: {e}")
134
  return False
135
 
136
- def validate_model_path(self) -> bool:
137
- """Validate that the model path contains required files"""
138
- # Support both safetensors and pytorch formats
139
- required_files = [
140
- "config.json",
141
- "tokenizer.json",
142
- "tokenizer_config.json"
143
  ]
144
-
145
- # Check for model files (either safetensors or pytorch)
146
- model_files = [
147
- "model.safetensors.index.json", # Safetensors format
148
- "pytorch_model.bin" # PyTorch format
 
 
 
 
149
  ]
150
-
151
- missing_files = []
152
- for file in required_files:
153
- if not (self.model_path / file).exists():
154
- missing_files.append(file)
155
-
156
- # Check if at least one model file exists
157
- model_file_exists = any((self.model_path / file).exists() for file in model_files)
158
- if not model_file_exists:
159
- missing_files.extend(model_files)
160
-
161
- if missing_files:
162
- logger.error(f"❌ Missing required files: {missing_files}")
163
- return False
164
-
165
- logger.info(" Model files validated")
166
- return True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
 
168
  def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
169
  """Create a comprehensive model card using the generate_model_card.py script"""
@@ -215,88 +237,48 @@ class HuggingFacePusher:
215
  return self._create_simple_model_card(training_config, results)
216
 
217
  def _create_simple_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
218
- """Create a simple model card without complex YAML to avoid formatting issues"""
219
- return f"""---
220
- language:
221
- - en
222
- - fr
223
- license: apache-2.0
224
- tags:
225
- - smollm3
226
- - fine-tuned
227
- - causal-lm
228
- - text-generation
229
- pipeline_tag: text-generation
230
- base_model: HuggingFaceTB/SmolLM3-3B
231
- ---
232
-
233
- # {self.repo_id.split('/')[-1]}
234
-
235
- This is a fine-tuned SmolLM3 model based on the HuggingFaceTB/SmolLM3-3B architecture.
236
-
237
- ## Model Details
238
-
239
- - **Base Model**: HuggingFaceTB/SmolLM3-3B
240
- - **Fine-tuning Method**: Supervised Fine-tuning
241
- - **Training Date**: {datetime.now().strftime('%Y-%m-%d')}
242
- - **Model Size**: {self._get_model_size():.1f} GB
243
- - **Dataset Repository**: {self.dataset_repo}
244
- - **Hardware**: {self._get_hardware_info()}
245
-
246
- ## Training Configuration
247
-
248
- ```json
249
- {json.dumps(training_config, indent=2)}
250
- ```
251
-
252
- ## Training Results
253
-
254
- ```json
255
- {json.dumps(results, indent=2)}
256
- ```
257
-
258
- ## Usage
259
-
260
- ```python
261
- from transformers import AutoModelForCausalLM, AutoTokenizer
262
-
263
- # Load model and tokenizer
264
- model = AutoModelForCausalLM.from_pretrained("{self.repo_id}")
265
- tokenizer = AutoTokenizer.from_pretrained("{self.repo_id}")
266
-
267
- # Generate text
268
- inputs = tokenizer("Hello, how are you?", return_tensors="pt")
269
- outputs = model.generate(**inputs, max_new_tokens=100)
270
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
271
- ```
272
-
273
- ## Training Information
274
-
275
- - **Base Model**: HuggingFaceTB/SmolLM3-3B
276
- - **Hardware**: {self._get_hardware_info()}
277
- - **Training Time**: {results.get('training_time_hours', 'Unknown')} hours
278
- - **Final Loss**: {results.get('final_loss', 'Unknown')}
279
- - **Final Accuracy**: {results.get('final_accuracy', 'Unknown')}
280
- - **Dataset Repository**: {self.dataset_repo}
281
-
282
- ## Model Performance
283
-
284
- - **Training Loss**: {results.get('train_loss', 'Unknown')}
285
- - **Validation Loss**: {results.get('eval_loss', 'Unknown')}
286
- - **Training Steps**: {results.get('total_steps', 'Unknown')}
287
-
288
- ## Experiment Tracking
289
-
290
- This model was trained with experiment tracking enabled. Training metrics and configuration are stored in the HF Dataset repository: `{self.dataset_repo}`
291
-
292
- ## Limitations and Biases
293
-
294
- This model is fine-tuned for specific tasks and may not generalize well to all use cases. Please evaluate the model's performance on your specific task before deployment.
295
-
296
- ## License
297
-
298
- This model is licensed under the Apache 2.0 License.
299
- """
300
 
301
  def _get_model_size(self) -> float:
302
  """Get model size in GB"""
 
69
 
70
  # Resolve the full repo id (username/repo) if user only provided repo name
71
  self.repo_id = self._resolve_repo_id(self.repo_name)
72
+ # Artifact type detection (full vs lora)
73
+ self.artifact_type: Optional[str] = None
74
 
75
  logger.info(f"Initialized HuggingFacePusher for {self.repo_id}")
76
 
 
135
  logger.error(f"❌ Failed to create repository: {e}")
136
  return False
137
 
138
+ def _detect_artifact_type(self) -> str:
139
+ """Detect whether output dir contains a full model or a LoRA adapter."""
140
+ # LoRA artifacts
141
+ lora_candidates = [
142
+ self.model_path / "adapter_config.json",
143
+ self.model_path / "adapter_model.safetensors",
144
+ self.model_path / "adapter_model.bin",
145
  ]
146
+ if any(p.exists() for p in lora_candidates) and (self.model_path / "adapter_config.json").exists():
147
+ return "lora"
148
+
149
+ # Full model artifacts
150
+ full_candidates = [
151
+ self.model_path / "config.json",
152
+ self.model_path / "model.safetensors",
153
+ self.model_path / "model.safetensors.index.json",
154
+ self.model_path / "pytorch_model.bin",
155
  ]
156
+ if any(p.exists() for p in full_candidates):
157
+ return "full"
158
+
159
+ return "unknown"
160
+
161
+ def validate_model_path(self) -> bool:
162
+ """Validate that the model path contains required files for Voxtral full or LoRA."""
163
+ self.artifact_type = self._detect_artifact_type()
164
+ if self.artifact_type == "lora":
165
+ required = [self.model_path / "adapter_config.json"]
166
+ if not all(p.exists() for p in required):
167
+ logger.error("❌ LoRA artifacts missing required files (adapter_config.json)")
168
+ return False
169
+ # At least one adapter weight
170
+ if not ((self.model_path / "adapter_model.safetensors").exists() or (self.model_path / "adapter_model.bin").exists()):
171
+ logger.error(" LoRA artifacts missing adapter weights (adapter_model.safetensors or adapter_model.bin)")
172
+ return False
173
+ logger.info("✅ Detected LoRA adapter artifacts")
174
+ return True
175
+
176
+ if self.artifact_type == "full":
177
+ # Relaxed set: require config.json and at least one model weights file
178
+ if not (self.model_path / "config.json").exists():
179
+ logger.error("❌ Missing config.json in model directory")
180
+ return False
181
+ if not ((self.model_path / "model.safetensors").exists() or (self.model_path / "model.safetensors.index.json").exists() or (self.model_path / "pytorch_model.bin").exists()):
182
+ logger.error("❌ Missing model weights file (model.safetensors or pytorch_model.bin)")
183
+ return False
184
+ logger.info("✅ Detected full model artifacts")
185
+ return True
186
+
187
+ logger.error("❌ Could not detect model artifacts (neither full model nor LoRA)")
188
+ return False
189
 
190
  def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
191
  """Create a comprehensive model card using the generate_model_card.py script"""
 
237
  return self._create_simple_model_card(training_config, results)
238
 
239
  def _create_simple_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
240
+ """Create a simple model card tailored for Voxtral ASR (supports full and LoRA)."""
241
+ tags = ["voxtral", "asr", "speech-to-text", "fine-tuning"]
242
+ if self.artifact_type == "lora":
243
+ tags.append("lora")
244
+ front_matter = {
245
+ "license": "apache-2.0",
246
+ "tags": tags,
247
+ "pipeline_tag": "automatic-speech-recognition",
248
+ }
249
+ fm_yaml = "---\n" + "\n".join([
250
+ "license: apache-2.0",
251
+ "tags:",
252
+ ]) + "\n" + "\n".join([f"- {t}" for t in tags]) + "\n" + "pipeline_tag: automatic-speech-recognition\n---\n\n"
253
+ model_title = self.repo_id.split('/')[-1]
254
+ body = [
255
+ f"# {model_title}",
256
+ "",
257
+ ("This repository contains a LoRA adapter for Voxtral ASR. "
258
+ "Merge the adapter with the base model or load via PEFT for inference." if self.artifact_type == "lora" else
259
+ "This repository contains a fine-tuned Voxtral ASR model."),
260
+ "",
261
+ "## Usage",
262
+ "",
263
+ ("```python\nfrom transformers import AutoProcessor\nfrom peft import PeftModel\nfrom transformers import AutoModelForSeq2SeqLM\n\nbase_model_id = 'mistralai/Voxtral-Mini-3B-2507'\nprocessor = AutoProcessor.from_pretrained(base_model_id)\nbase_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id)\nmodel = PeftModel.from_pretrained(base_model, '{self.repo_id}')\n```" if self.artifact_type == "lora" else
264
+ f"""```python
265
+ from transformers import AutoProcessor, AutoModelForSeq2SeqLM
266
+
267
+ processor = AutoProcessor.from_pretrained("{self.repo_id}")
268
+ model = AutoModelForSeq2SeqLM.from_pretrained("{self.repo_id}")
269
+ ```"""),
270
+ "",
271
+ "## Training Configuration",
272
+ "",
273
+ f"```json\n{json.dumps(training_config or {}, indent=2)}\n```",
274
+ "",
275
+ "## Training Results",
276
+ "",
277
+ f"```json\n{json.dumps(results or {}, indent=2)}\n```",
278
+ "",
279
+ f"**Hardware**: {self._get_hardware_info()}",
280
+ ]
281
+ return fm_yaml + "\n".join(body)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
282
 
283
  def _get_model_size(self) -> float:
284
  """Get model size in GB"""
templates/model_card.md CHANGED
@@ -5,12 +5,10 @@ language:
5
  license: apache-2.0
6
  library_name: transformers
7
  tags:
8
- - smollm3
9
  - fine-tuned
10
- - causal-lm
11
  - text-generation
12
  - tonic
13
- - legml
14
  {{#if quantized_models}}- quantized{{/if}}
15
  pipeline_tag: text-generation
16
  base_model: {{base_model}}
 
5
  license: apache-2.0
6
  library_name: transformers
7
  tags:
8
+ - voxtral
9
  - fine-tuned
 
10
  - text-generation
11
  - tonic
 
12
  {{#if quantized_models}}- quantized{{/if}}
13
  pipeline_tag: text-generation
14
  base_model: {{base_model}}
templates/spaces/demo_voxtral/README.md CHANGED
@@ -12,12 +12,24 @@ short_description: Interactive ASR demo for a fine-tuned Voxtral model
12
  This Space serves a Voxtral ASR model for speech-to-text transcription.
13
  Usage:
14
 
15
- - Click Record and read the displayed phrase aloud.
16
- - Stop recording to see the transcription.
17
- - Works best with ~16 kHz audio; internal processing follows Voxtral's processor expectations.
 
18
 
19
  Environment variables expected:
20
 
21
  - `HF_MODEL_ID`: The model repo to load (e.g., `username/voxtral-finetune-YYYYMMDD_HHMMSS`)
22
  - `MODEL_NAME`: Display name
23
  - `HF_USERNAME`: For branding
 
 
 
 
 
 
 
 
 
 
 
 
12
  This Space serves a Voxtral ASR model for speech-to-text transcription.
13
  Usage:
14
 
15
+ - Select a language (or leave on Auto for detection).
16
+ - Upload an audio file or record via microphone.
17
+ - Click Transcribe to see the transcription.
18
+ - Works best with standard speech audio; Voxtral handles language detection by default.
19
 
20
  Environment variables expected:
21
 
22
  - `HF_MODEL_ID`: The model repo to load (e.g., `username/voxtral-finetune-YYYYMMDD_HHMMSS`)
23
  - `MODEL_NAME`: Display name
24
  - `HF_USERNAME`: For branding
25
+ - `MODEL_SUBFOLDER`: Optional subfolder in the repo (e.g., `int4`) for quantized/packed weights
26
+
27
+ Supported languages:
28
+
29
+ - English, French, German, Spanish, Italian, Portuguese, Dutch, Hindi
30
+ - Or choose Auto to let the model detect the language
31
+
32
+ Notes:
33
+
34
+ - Uses bfloat16 on GPU and float32 on CPU.
35
+ - Decodes only newly generated tokens for clean transcriptions.
templates/spaces/demo_voxtral/app.py CHANGED
@@ -1,33 +1,100 @@
1
  import os
2
  import gradio as gr
3
  import torch
4
- from transformers import AutoProcessor, AutoModelForSeq2SeqLM
 
 
 
 
 
5
 
6
  HF_MODEL_ID = os.getenv("HF_MODEL_ID", "mistralai/Voxtral-Mini-3B-2507")
7
  MODEL_NAME = os.getenv("MODEL_NAME", HF_MODEL_ID.split("/")[-1])
8
  HF_USERNAME = os.getenv("HF_USERNAME", "")
 
9
 
10
- processor = AutoProcessor.from_pretrained(HF_MODEL_ID)
11
- model = AutoModelForSeq2SeqLM.from_pretrained(HF_MODEL_ID, device_map="auto", torch_dtype=torch.bfloat16)
 
 
 
 
 
 
12
 
13
- def transcribe(audio_tuple):
14
- if audio_tuple is None:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  return "No audio provided"
16
- sr, data = audio_tuple
17
- inputs = processor.apply_transcription_request(language="en", model_id=HF_MODEL_ID, audio=[data], format=["WAV"], return_tensors="pt")
18
- inputs = {k: (v.to(model.device) if hasattr(v, 'to') else v) for k, v in inputs.items()}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  with torch.no_grad():
20
- output_ids = model.generate(**inputs, max_new_tokens=256)
21
- # Voxtral returns full sequence; decode and strip special tokens
22
- text = processor.tokenizer.decode(output_ids[0], skip_special_tokens=True)
23
- return text
 
 
24
 
25
  with gr.Blocks() as demo:
26
  gr.Markdown(f"# 🎙️ Voxtral ASR Demo — {MODEL_NAME}")
27
- audio = gr.Audio(sources="microphone", type="numpy", label="Record or upload audio")
 
 
 
 
 
 
 
 
28
  btn = gr.Button("Transcribe")
29
- out = gr.Textbox(label="Transcription", lines=4)
30
- btn.click(transcribe, inputs=[audio], outputs=[out])
31
 
32
  if __name__ == "__main__":
33
  demo.launch(mcp_server=True, ssr_mode=False)
 
1
  import os
2
  import gradio as gr
3
  import torch
4
+ from transformers import AutoProcessor
5
+ try:
6
+ from transformers import VoxtralForConditionalGeneration as VoxtralModelClass
7
+ except Exception:
8
+ # Fallback for older transformers versions
9
+ from transformers import AutoModelForSeq2SeqLM as VoxtralModelClass
10
 
11
  HF_MODEL_ID = os.getenv("HF_MODEL_ID", "mistralai/Voxtral-Mini-3B-2507")
12
  MODEL_NAME = os.getenv("MODEL_NAME", HF_MODEL_ID.split("/")[-1])
13
  HF_USERNAME = os.getenv("HF_USERNAME", "")
14
+ MODEL_SUBFOLDER = os.getenv("MODEL_SUBFOLDER", "").strip()
15
 
16
+ try:
17
+ processor = AutoProcessor.from_pretrained(HF_MODEL_ID)
18
+ except Exception:
19
+ # Fallback: some repos may store processor files inside the subfolder
20
+ if MODEL_SUBFOLDER:
21
+ processor = AutoProcessor.from_pretrained(HF_MODEL_ID, subfolder=MODEL_SUBFOLDER)
22
+ else:
23
+ raise
24
 
25
+ device = "cuda" if torch.cuda.is_available() else "cpu"
26
+ # Use float32 on CPU; bfloat16 on CUDA if available
27
+ if torch.cuda.is_available():
28
+ model_kwargs = {"device_map": "auto", "torch_dtype": torch.bfloat16}
29
+ else:
30
+ model_kwargs = {"torch_dtype": torch.float32}
31
+
32
+ if MODEL_SUBFOLDER:
33
+ model = VoxtralModelClass.from_pretrained(
34
+ HF_MODEL_ID, subfolder=MODEL_SUBFOLDER, **model_kwargs
35
+ )
36
+ else:
37
+ model = VoxtralModelClass.from_pretrained(
38
+ HF_MODEL_ID, **model_kwargs
39
+ )
40
+
41
+ # Simple language options (with Auto detection)
42
+ LANGUAGES = {
43
+ "Auto": "auto",
44
+ "English": "en",
45
+ "French": "fr",
46
+ "German": "de",
47
+ "Spanish": "es",
48
+ "Italian": "it",
49
+ "Portuguese": "pt",
50
+ "Dutch": "nl",
51
+ "Hindi": "hi",
52
+ }
53
+
54
+ MAX_NEW_TOKENS = 1024
55
+
56
+ def transcribe(sel_language, audio_path):
57
+ if audio_path is None:
58
  return "No audio provided"
59
+ language_code = LANGUAGES.get(sel_language, "auto")
60
+ # Build Voxtral transcription inputs from filepath and selected language
61
+ if hasattr(processor, "apply_transcrition_request"):
62
+ inputs = processor.apply_transcrition_request(
63
+ language=language_code,
64
+ audio=audio_path,
65
+ model_id=HF_MODEL_ID,
66
+ )
67
+ else:
68
+ # Compatibility with potential corrected naming
69
+ inputs = processor.apply_transcription_request(
70
+ language=language_code,
71
+ audio=audio_path,
72
+ model_id=HF_MODEL_ID,
73
+ )
74
+ # Move to device with appropriate dtype
75
+ inputs = inputs.to(device, dtype=(torch.bfloat16 if device == "cuda" else torch.float32))
76
  with torch.no_grad():
77
+ output_ids = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
78
+ # Decode only newly generated tokens (beyond the prompt length)
79
+ decoded = processor.batch_decode(
80
+ output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
81
+ )
82
+ return decoded[0]
83
 
84
  with gr.Blocks() as demo:
85
  gr.Markdown(f"# 🎙️ Voxtral ASR Demo — {MODEL_NAME}")
86
+ with gr.Row():
87
+ language = gr.Dropdown(
88
+ choices=list(LANGUAGES.keys()), value="Auto", label="Language"
89
+ )
90
+ audio = gr.Audio(
91
+ sources=["upload", "microphone"],
92
+ type="filepath",
93
+ label="Upload or record audio",
94
+ )
95
  btn = gr.Button("Transcribe")
96
+ out = gr.Textbox(label="Transcription", lines=8)
97
+ btn.click(transcribe, inputs=[language, audio], outputs=[out])
98
 
99
  if __name__ == "__main__":
100
  demo.launch(mcp_server=True, ssr_mode=False)