Spaces:

Tonic
/

VoxFactory

Running

App Files Files Community

VoxFactory / docs /blog-accessibility.md

Joseph Pollack

improves demo for automatic deployment and interface linking to deployment scripts

a595d5a unverified about 2 months ago

preview code

raw

history blame

18.9 kB

	# Accessible Speech Recognition: Fine‑tune Voxtral on Your Own Voice

	Building speech technology that understands everyone is an accessibility imperative. If you have a speech impediment (e.g., stutter, dysarthria, apraxia) or a heavy accent, mainstream ASR systems can struggle. This app lets you fine‑tune the Voxtral ASR model on your own voice so it adapts to your unique speaking style — improving recognition accuracy and unlocking more inclusive voice experiences.

	## Who this helps

	- People with speech differences: Personalized models that reduce error rates on your voice
	- Accented speakers: Adapt Voxtral to your accent and vocabulary
	- Educators/clinicians: Create tailored recognition models for communication support
	- Product teams: Prototype inclusive voice features with real users quickly

	## What you get

	- Record or upload audio and create a JSONL dataset in a few clicks
	- One‑click training with full fine‑tuning or LoRA for efficiency
	- Automatic publishing to Hugging Face Hub with a generated model card
	- Instant demo deployment to HF Spaces for shareable, live ASR

	## How it works (at a glance)

	```mermaid
	graph TD
	%% Main Entry Point
	START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}

	%% Documentation Categories
	OVERVIEW --> ARCH[🏗️ Architecture Overview]
	OVERVIEW --> WORKFLOW[🔄 Interface Workflow]
	OVERVIEW --> TRAINING[🚀 Training Pipeline]
	OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
	OVERVIEW --> DATAFLOW[📊 Data Flow]

	%% Architecture Section
	ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
	ARCH --> ARCH_LINK[📄 View Details →](architecture.md)

	%% Interface Section
	WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording → Training → Demo]
	WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md)

	%% Training Section
	TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data → Model → Results]
	TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md)

	%% Deployment Section
	DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model → Hub → Space]
	DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md)

	%% Data Flow Section
	DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input → Processing → Output]
	DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md)

	%% Key Components Highlight
	subgraph "🎛️ Core Components"
	INTERFACE[interface.py<br/>Gradio Web UI]
	TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
	DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
	PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
	end

	%% Data Flow Highlight
	subgraph "📁 Key Data Formats"
	JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
	HFDATA[HF Hub Models<br/>username/model-name]
	SPACES[HF Spaces<br/>Interactive Demos]
	end

	%% Connect components to their respective docs
	INTERFACE --> WORKFLOW
	TRAIN_SCRIPTS --> TRAINING
	DEPLOY_SCRIPT --> DEPLOYMENT
	PUSH_SCRIPT --> DEPLOYMENT

	JSONL --> DATAFLOW
	HFDATA --> DEPLOYMENT
	SPACES --> DEPLOYMENT

	%% Styling
	classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
	classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
	classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
	classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
	classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
	classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px

	class START entry
	class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
	class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
	class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
	class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
	class JSONL,HFDATA,SPACES data
	```

	See the interactive diagram page for printing and quick navigation: [Interactive diagrams](diagrams.html).

	## Quick start

	### 1) Install

	```bash
	git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
	cd Finetune-Voxtral-ASR
	```

	Use UV (recommended) or pip.

	```bash
	# UV
	uv venv .venv --python 3.10 && source .venv/bin/activate
	uv pip install -r requirements.txt

	# or pip
	python -m venv .venv --python 3.10 && source .venv/bin/activate
	pip install --upgrade pip
	pip install -r requirements.txt
	```

	### 2) Launch the interface

	```bash
	python interface.py
	```

	The Gradio app guides you through language selection, recording or uploading audio, dataset creation, and training.

	## Create your voice dataset (UI)

	```mermaid
	stateDiagram-v2
	[*] --> LanguageSelection: User opens interface

	state "Language & Dataset Setup" as LangSetup {
	[*] --> LanguageSelection
	LanguageSelection --> LoadPhrases: Select language
	LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
	DisplayPhrases --> RecordingInterface: Show phrases & recording UI

	state RecordingInterface {
	[*] --> ShowInitialRows: Display first 10 phrases
	ShowInitialRows --> RecordAudio: User can record audio
	RecordAudio --> AddMoreRows: Optional - add 10 more rows
	AddMoreRows --> RecordAudio
	}
	}

	RecordingInterface --> DatasetCreation: User finishes recording

	state "Dataset Creation Options" as DatasetCreation {
	[*] --> FromRecordings: Create from recorded audio
	[*] --> FromUploads: Upload existing files

	FromRecordings --> ProcessRecordings: Save WAV files + transcripts
	FromUploads --> ProcessUploads: Process uploaded files + transcripts

	ProcessRecordings --> CreateJSONL: Generate JSONL dataset
	ProcessUploads --> CreateJSONL

	CreateJSONL --> DatasetReady: Dataset saved locally
	}

	DatasetCreation --> TrainingConfiguration: Dataset ready

	state "Training Setup" as TrainingConfiguration {
	[*] --> BasicSettings: Model, LoRA/full, batch size
	[*] --> AdvancedSettings: Learning rate, epochs, LoRA params

	BasicSettings --> ConfigureDeployment: Repo name, push options
	AdvancedSettings --> ConfigureDeployment

	ConfigureDeployment --> StartTraining: All settings configured
	}

	TrainingConfiguration --> TrainingProcess: Start training

	state "Training Process" as TrainingProcess {
	[*] --> InitializeTrackio: Setup experiment tracking
	InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
	RunTrainingScript --> StreamLogs: Show real-time training logs
	StreamLogs --> MonitorProgress: Track metrics & checkpoints

	MonitorProgress --> TrainingComplete: Training finished
	MonitorProgress --> HandleErrors: Training failed
	HandleErrors --> RetryOrExit: User can retry or exit
	}

	TrainingProcess --> PostTraining: Training complete

	state "Post-Training Actions" as PostTraining {
	[*] --> PushToHub: Push model to HF Hub
	[*] --> GenerateModelCard: Create model card
	[*] --> DeployDemoSpace: Deploy interactive demo

	PushToHub --> ModelPublished: Model available on HF Hub
	GenerateModelCard --> ModelDocumented: Model card created
	DeployDemoSpace --> DemoReady: Demo space deployed
	}

	PostTraining --> [*]: Process complete

	%% Alternative paths
	DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
	PushDatasetOnly --> DatasetPublished: Dataset on HF Hub

	%% Error handling
	TrainingProcess --> ErrorRecovery: Handle training errors
	ErrorRecovery --> RetryTraining: Retry with different settings
	RetryTraining --> TrainingConfiguration

	%% Styling and notes
	note right of LanguageSelection : User selects language for\n authentic phrases from\n NVIDIA Granary dataset
	note right of RecordingInterface : Users record themselves\n reading displayed phrases
	note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
	note right of TrainingConfiguration : Configure LoRA parameters,\n learning rate, epochs, etc.
	note right of TrainingProcess : Real-time log streaming\n with Trackio integration
	note right of PostTraining : Automated deployment\n pipeline
	```

	Steps you’ll follow in the UI:

	- Choose language: Select a language for authentic phrases (from NVIDIA Granary)
	- Record or upload: Capture your voice or provide existing audio + transcripts
	- Create dataset: The app writes a JSONL file with entries like `{ "audio_path": ..., "text": ... }`
	- Configure training: Pick base model, LoRA vs full, batch size and learning rate
	- Run training: Watch live logs and metrics; resume on error if needed
	- Publish & deploy: Push to HF Hub and one‑click deploy an interactive Space

	## Train your personalized Voxtral model

	Under the hood, training uses Hugging Face Trainer and a custom `VoxtralDataCollator` that builds Voxtral/LLaMA‑style prompts and masks the prompt tokens so loss is computed only on the transcription.

	```mermaid
	graph TB
	%% Input Data Sources
	subgraph "Data Sources"
	JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
	GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
	HFDATA[HF Hub Datasets<br/>Community Datasets]
	end

	%% Data Processing
	subgraph "Data Processing"
	LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
	CASTER[Audio Casting<br/>16kHz resampling]
	COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
	end

	%% Training Scripts
	subgraph "Training Scripts"
	TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
	TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]

	subgraph "Training Components"
	MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
	LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
	PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
	end
	end

	%% Training Infrastructure
	subgraph "Training Infrastructure"
	TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
	HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
	TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
	end

	%% Training Process
	subgraph "Training Process"
	FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
	LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
	BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
	OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
	LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
	end

	%% Model Management
	subgraph "Model Management"
	CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
	MODEL_SAVING[Final Model Saving<br/>Processor + Model]
	LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
	end

	%% Flow Connections
	JSONL --> LOADER
	GRANARY --> LOADER
	HFDATA --> LOADER

	LOADER --> CASTER
	CASTER --> COLLATOR

	COLLATOR --> TRAIN_FULL
	COLLATOR --> TRAIN_LORA

	TRAIN_FULL --> MODEL_INIT
	TRAIN_LORA --> MODEL_INIT
	TRAIN_LORA --> LORA_CONFIG

	MODEL_INIT --> PROCESSOR_INIT
	LORA_CONFIG --> PROCESSOR_INIT

	PROCESSOR_INIT --> TRACKIO_INIT
	PROCESSOR_INIT --> HF_TRAINER
	PROCESSOR_INIT --> TORCH_DEVICE

	TRACKIO_INIT --> HF_TRAINER
	TORCH_DEVICE --> HF_TRAINER

	HF_TRAINER --> FORWARD_PASS
	FORWARD_PASS --> LOSS_CALC
	LOSS_CALC --> BACKWARD_PASS
	BACKWARD_PASS --> OPTIMIZER_STEP
	OPTIMIZER_STEP --> LOGGING

	LOGGING --> CHECKPOINT_SAVING
	LOGGING --> TRACKIO_INIT

	HF_TRAINER --> MODEL_SAVING
	MODEL_SAVING --> LOCAL_STORAGE

	%% Styling
	classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
	classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
	classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
	classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
	classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
	classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px

	class JSONL,GRANARY,HFDATA input
	class LOADER,CASTER,COLLATOR processing
	class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
	class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
	class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
	class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
	```

	CLI alternatives (if you prefer the terminal):

	```bash
	# Full fine-tuning
	uv run train.py

	# Parameter‑efficient LoRA fine‑tuning (recommended for most users)
	uv run train_lora.py
	```

	## Publish and deploy a live demo

	After training, the app can push your model and metrics to the Hugging Face Hub and create an interactive Space demo automatically.

	```mermaid
	graph TB
	%% Input Sources
	subgraph "Inputs"
	TRAINED_MODEL[Trained Model<br/>Local directory]
	TRAINING_CONFIG[Training Config<br/>JSON/YAML]
	TRAINING_RESULTS[Training Results<br/>Metrics & logs]
	MODEL_METADATA[Model Metadata<br/>Name, description, etc.]
	end

	%% Model Publishing
	subgraph "Model Publishing"
	PUSH_SCRIPT[push_to_huggingface.py<br/>Model Publisher]

	subgraph "Publishing Steps"
	REPO_CREATION[Repository Creation<br/>HF Hub API]
	FILE_UPLOAD[File Upload<br/>Model files to HF]
	METADATA_UPLOAD[Metadata Upload<br/>Config & results]
	end
	end

	%% Model Card Generation
	subgraph "Model Card Generation"
	CARD_SCRIPT[generate_model_card.py<br/>Card Generator]

	subgraph "Card Components"
	TEMPLATE_LOAD[Template Loading<br/>model_card.md]
	VARIABLE_REPLACEMENT[Variable Replacement<br/>Config injection]
	CONDITIONAL_PROCESSING[Conditional Sections<br/>Quantized models, etc.]
	end
	end

	%% Demo Space Deployment
	subgraph "Demo Space Deployment"
	DEPLOY_SCRIPT[deploy_demo_space.py<br/>Space Deployer]

	subgraph "Space Setup"
	SPACE_CREATION[Space Repository<br/>Create HF Space]
	TEMPLATE_COPY[Template Copying<br/>demo_voxtral/ files]
	ENV_INJECTION[Environment Setup<br/>Model config injection]
	SECRET_SETUP[Secret Configuration<br/>HF_TOKEN, model vars]
	end
	end

	%% Space Building & Testing
	subgraph "Space Building"
	BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
	DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
	MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
	APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
	end

	%% Live Demo
	subgraph "Live Demo Space"
	GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
	MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
	USER_INTERACTION[User Interaction<br/>Audio upload/playback]
	end

	%% External Services
	subgraph "External Services"
	HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
	HF_SPACES[HF Spaces Platform<br/>Demo hosting]
	end

	%% Flow Connections
	TRAINED_MODEL --> PUSH_SCRIPT
	TRAINING_CONFIG --> PUSH_SCRIPT
	TRAINING_RESULTS --> PUSH_SCRIPT
	MODEL_METADATA --> PUSH_SCRIPT

	PUSH_SCRIPT --> REPO_CREATION
	REPO_CREATION --> FILE_UPLOAD
	FILE_UPLOAD --> METADATA_UPLOAD

	METADATA_UPLOAD --> CARD_SCRIPT
	TRAINING_CONFIG --> CARD_SCRIPT
	TRAINING_RESULTS --> CARD_SCRIPT

	CARD_SCRIPT --> TEMPLATE_LOAD
	TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
	VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING

	CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
	METADATA_UPLOAD --> DEPLOY_SCRIPT

	DEPLOY_SCRIPT --> SPACE_CREATION
	SPACE_CREATION --> TEMPLATE_COPY
	TEMPLATE_COPY --> ENV_INJECTION
	ENV_INJECTION --> SECRET_SETUP

	SECRET_SETUP --> BUILD_TRIGGER
	BUILD_TRIGGER --> DEPENDENCY_INSTALL
	DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
	MODEL_DOWNLOAD --> APP_INITIALIZATION

	APP_INITIALIZATION --> GRADIO_INTERFACE
	GRADIO_INTERFACE --> MODEL_INFERENCE
	MODEL_INFERENCE --> USER_INTERACTION

	HF_HUB --> MODEL_DOWNLOAD
	HF_SPACES --> GRADIO_INTERFACE

	%% Styling
	classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
	classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
	classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
	classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
	classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
	classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
	classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px

	class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
	class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
	class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
	class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
	class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
	class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
	class HF_HUB,HF_SPACES external
	```

	## Why personalization improves accessibility

	- Your model learns your patterns: tempo, prosody, phoneme realizations, disfluencies
	- Vocabulary and names: teach domain terms and proper nouns you use often
	- Bias correction: reduce systematic errors common to off‑the‑shelf ASR for your voice
	- Agency and privacy: keep data local and only publish when you choose

	## Practical tips

	- Start with LoRA: Parameter‑efficient fine‑tuning is faster and uses less memory
	- Record diverse samples: Different tempos, environments, and phrase lengths
	- Short sessions: Many shorter clips beat a few long ones for learning
	- Check transcripts: Clean, accurate transcripts improve outcomes

	## Learn more

	- [Repository README](../README.md)
	- [Documentation Overview](README.md)
	- [Architecture Overview](architecture.md)
	- [Interface Workflow](interface-workflow.md)
	- [Training Pipeline](training-pipeline.md)
	- [Deployment Pipeline](deployment-pipeline.md)
	- [Data Flow](data-flow.md)
	- [Interactive Diagrams](diagrams.html)

	---

	This project exists to make voice technology work better for everyone. If you build a model that helps you — or your community — consider sharing a demo so others can learn from it.