Spaces:

Tonic
/

VoxFactory

Running

App Files Files Community

Joseph Pollack commited on Sep 3

Commit

a3a3978

unverified ·

1 Parent(s): 6434b46

adds docs

Browse files

Files changed (9) hide show

docs/README.md +246 -0
docs/architecture.md +126 -0
docs/data-flow.md +374 -0
docs/deployment-pipeline.md +323 -0
docs/diagrams.html +728 -0
docs/interface-workflow.md +173 -0
docs/training-pipeline.md +271 -0
scripts/generate_svgs.py +135 -0
scripts/validate_mermaid.py +73 -0

docs/README.md ADDED Viewed

	@@ -0,0 +1,246 @@

+# Voxtral ASR Fine-tuning Documentation
+```mermaid
+graph TD
+    %% Main Entry Point
+    START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
+    %% Documentation Categories
+    OVERVIEW --> ARCH[🏗️ Architecture Overview]
+    OVERVIEW --> WORKFLOW[🔄 Interface Workflow]
+    OVERVIEW --> TRAINING[🚀 Training Pipeline]
+    OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
+    OVERVIEW --> DATAFLOW[📊 Data Flow]
+    %% Architecture Section
+    ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
+    ARCH --> ARCH_LINK[📄 View Details →](architecture.md)
+    %% Interface Section
+    WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording → Training → Demo]
+    WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md)
+    %% Training Section
+    TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data → Model → Results]
+    TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md)
+    %% Deployment Section
+    DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model → Hub → Space]
+    DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md)
+    %% Data Flow Section
+    DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input → Processing → Output]
+    DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md)
+    %% Key Components Highlight
+    subgraph "🎛️ Core Components"
+        INTERFACE[interface.py<br/>Gradio Web UI]
+        TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
+        DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
+        PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
+    end
+    %% Data Flow Highlight
+    subgraph "📁 Key Data Formats"
+        JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
+        HFDATA[HF Hub Models<br/>username/model-name]
+        SPACES[HF Spaces<br/>Interactive Demos]
+    end
+    %% Connect components to their respective docs
+    INTERFACE --> WORKFLOW
+    TRAIN_SCRIPTS --> TRAINING
+    DEPLOY_SCRIPT --> DEPLOYMENT
+    PUSH_SCRIPT --> DEPLOYMENT
+    JSONL --> DATAFLOW
+    HFDATA --> DEPLOYMENT
+    SPACES --> DEPLOYMENT
+    %% Styling
+    classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
+    classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    class START entry
+    class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
+    class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
+    class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
+    class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
+    class JSONL,HFDATA,SPACES data
+```
+## Voxtral ASR Fine-tuning Application
+This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows.
+### 🎯 What is Voxtral ASR Fine-tuning?
+Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides:
+- **🎙️ Easy Data Collection**: Record audio or upload files with transcripts
+- **🚀 One-Click Training**: Fine-tune Voxtral with LoRA or full parameter updates
+- **🌐 Instant Deployment**: Deploy interactive demos to Hugging Face Spaces
+- **📊 Experiment Tracking**: Monitor training progress with Trackio integration
+### 📚 Documentation Overview
+#### 🏗️ [Architecture Overview](architecture.md)
+High-level view of system components and their relationships:
+- **User Interface Layer**: Gradio web interface
+- **Data Processing Layer**: Audio processing and dataset creation
+- **Training Layer**: Full and LoRA fine-tuning scripts
+- **Model Management Layer**: HF Hub integration and model cards
+- **Deployment Layer**: Demo space deployment
+#### 🔄 [Interface Workflow](interface-workflow.md)
+Complete user journey through the application:
+- **Language Selection**: Choose from 25+ languages via NVIDIA Granary
+- **Data Collection**: Record audio or upload existing files
+- **Dataset Creation**: Process audio + transcripts into JSONL format
+- **Training Configuration**: Set hyperparameters and options
+- **Live Training**: Real-time progress monitoring
+- **Auto Deployment**: One-click model publishing and demo creation
+#### 🚀 [Training Pipeline](training-pipeline.md)
+Detailed training process and script interactions:
+- **Data Sources**: JSONL datasets, HF Hub datasets, NVIDIA Granary
+- **Data Processing**: Audio resampling, text tokenization, data collation
+- **Training Scripts**: `train.py` (full) vs `train_lora.py` (parameter-efficient)
+- **Infrastructure**: Trackio logging, Hugging Face Trainer, device management
+- **Model Outputs**: Trained models, training logs, checkpoints
+#### 🌐 [Deployment Pipeline](deployment-pipeline.md)
+Model publishing and demo deployment process:
+- **Model Publishing**: Push to Hugging Face Hub with metadata
+- **Model Card Generation**: Automated documentation creation
+- **Demo Space Deployment**: Create interactive demos on HF Spaces
+- **Configuration Management**: Environment variables and secrets
+- **Live Demo Features**: Real-time ASR inference interface
+#### 📊 [Data Flow](data-flow.md)
+Complete data journey through the system:
+- **Input Sources**: Microphone recordings, file uploads, external datasets
+- **Processing Pipeline**: Audio resampling, text cleaning, JSONL conversion
+- **Training Flow**: Dataset loading, batching, model training
+- **Output Pipeline**: Model files, logs, checkpoints, published assets
+- **External Integration**: HF Hub, NVIDIA Granary, Trackio Spaces
+### 🛠️ Core Components
+| Component | Purpose | Key Features |
+|-----------|---------|--------------|
+| `interface.py` | Main web application | Gradio UI, data collection, training orchestration |
+| `scripts/train.py` | Full model fine-tuning | Complete parameter updates, maximum accuracy |
+| `scripts/train_lora.py` | LoRA fine-tuning | Parameter-efficient, faster training, lower memory |
+| `scripts/deploy_demo_space.py` | Demo deployment | Automated HF Spaces creation and configuration |
+| `scripts/push_to_huggingface.py` | Model publishing | HF Hub integration, model card generation |
+| `scripts/generate_model_card.py` | Documentation | Automated model card creation from templates |
+### 📁 Key Data Formats
+#### JSONL Dataset Format
+```json
+{"audio_path": "path/to/audio.wav", "text": "transcription text"}
+```
+#### Training Configuration
+```json
+{
+  "model_checkpoint": "mistralai/Voxtral-Mini-3B-2507",
+  "batch_size": 2,
+  "learning_rate": 5e-5,
+  "epochs": 3,
+  "lora_r": 8,
+  "lora_alpha": 32
+}
+```
+#### Model Repository Structure
+```
+username/model-name/
+├── model.safetensors
+├── config.json
+├── tokenizer.json
+├── README.md (model card)
+└── training_results/
+```
+### 🚀 Quick Start
+1. **Set Environment Variables**:
+   ```bash
+   export HF_TOKEN=your_huggingface_token
+   export HF_USERNAME=your_username
+   ```
+2. **Launch Interface**:
+   ```bash
+   python interface.py
+   ```
+3. **Follow the Workflow**:
+   - Select language → Record/upload data → Configure training → Start training
+   - Monitor progress → View results → Deploy demo
+### 📋 Prerequisites
+- **Hardware**: NVIDIA GPU recommended for training
+- **Software**: Python 3.8+, CUDA-compatible GPU drivers
+- **Tokens**: Hugging Face token for model access and publishing
+- **Storage**: Sufficient disk space for models and datasets
+### 🔧 Configuration Options
+#### Training Modes
+- **LoRA Fine-tuning**: Efficient, fast, lower memory usage
+- **Full Fine-tuning**: Maximum accuracy, higher memory requirements
+#### Data Sources
+- **User Recordings**: Live microphone input
+- **File Uploads**: Existing WAV/FLAC files
+- **NVIDIA Granary**: High-quality multilingual datasets
+- **HF Hub Datasets**: Community-contributed datasets
+#### Deployment Options
+- **HF Hub Publishing**: Share models publicly
+- **Demo Spaces**: Interactive web demos
+- **Model Cards**: Automated documentation
+### 📈 Performance & Metrics
+#### Training Metrics
+- **Loss Curves**: Training and validation loss
+- **Perplexity**: Model confidence measure
+- **Word Error Rate**: ASR accuracy (if available)
+- **Training Time**: Time to convergence
+#### Resource Usage
+- **GPU Memory**: Peak memory usage during training
+- **Training Time**: Hours/days depending on dataset size
+- **Model Size**: Disk space requirements
+### 🤝 Contributing
+The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect:
+- **architecture.md**: System overview and component relationships
+- **interface-workflow.md**: User experience and interaction flow
+- **training-pipeline.md**: Technical training process details
+- **deployment-pipeline.md**: Publishing and deployment mechanics
+- **data-flow.md**: Data movement and transformation
+### 📄 Additional Resources
+- **Hugging Face Spaces**: [Live Demo](https://huggingface.co/spaces)
+- **Voxtral Models**: [Model Hub](https://huggingface.co/mistralai)
+- **NVIDIA Granary**: [Dataset Documentation](https://huggingface.co/nvidia/Granary)
+- **Trackio**: [Experiment Tracking](https://trackio.space)
+---
+*This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.*

docs/architecture.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# Voxtral ASR Fine-tuning Architecture
+```mermaid
+graph TB
+    %% User Interface Layer
+    subgraph "User Interface"
+        UI[Gradio Web Interface<br/>interface.py]
+        REC[Audio Recording<br/>Microphone Input]
+        UP[File Upload<br/>WAV/FLAC files]
+    end
+    %% Data Processing Layer
+    subgraph "Data Processing"
+        DP[Data Processing<br/>Audio resampling<br/>JSONL creation]
+        DS[Dataset Management<br/>NVIDIA Granary<br/>Local datasets]
+    end
+    %% Training Layer
+    subgraph "Training Pipeline"
+        TF[Full Fine-tuning<br/>scripts/train.py]
+        TL[LoRA Fine-tuning<br/>scripts/train_lora.py]
+        TI[Trackio Integration<br/>Experiment Tracking]
+    end
+    %% Model Management Layer
+    subgraph "Model Management"
+        MM[Model Management<br/>Hugging Face Hub<br/>Local storage]
+        MC[Model Card Generation<br/>scripts/generate_model_card.py]
+    end
+    %% Deployment Layer
+    subgraph "Deployment & Demo"
+        DEP[Demo Space Deployment<br/>scripts/deploy_demo_space.py]
+        HF[HF Spaces<br/>Interactive Demo]
+    end
+    %% External Services
+    subgraph "External Services"
+        HFH[Hugging Face Hub<br/>Models & Datasets]
+        GRAN[NVIDIA Granary<br/>Multilingual ASR Dataset]
+        TRACK[Trackio Spaces<br/>Experiment Tracking]
+    end
+    %% Data Flow
+    UI --> DP
+    REC --> DP
+    UP --> DP
+    DP --> DS
+    DS --> TF
+    DS --> TL
+    TF --> TI
+    TL --> TI
+    TF --> MM
+    TL --> MM
+    MM --> MC
+    MM --> DEP
+    DEP --> HF
+    DS -.-> HFH
+    MM -.-> HFH
+    TI -.-> TRACK
+    DS -.-> GRAN
+    %% Styling
+    classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
+    classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
+    classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
+    classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
+    classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
+    classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class UI,REC,UP interface
+    class DP,DS processing
+    class TF,TL,TI training
+    class MM,MC management
+    class DEP,HF deployment
+    class HFH,GRAN,TRACK external
+```
+## Architecture Overview
+This diagram shows the high-level architecture of the Voxtral ASR Fine-tuning application. The system is organized into several layers:
+### 1. User Interface Layer
+- **Gradio Web Interface**: Main user-facing application built with Gradio
+- **Audio Recording**: Microphone input for recording speech samples
+- **File Upload**: Support for uploading existing WAV/FLAC audio files
+### 2. Data Processing Layer
+- **Data Processing**: Audio resampling to 16kHz, JSONL dataset creation
+- **Dataset Management**: Integration with NVIDIA Granary dataset and local dataset handling
+### 3. Training Layer
+- **Full Fine-tuning**: Complete model fine-tuning using `scripts/train.py`
+- **LoRA Fine-tuning**: Parameter-efficient fine-tuning using `scripts/train_lora.py`
+- **Trackio Integration**: Experiment tracking and logging
+### 4. Model Management Layer
+- **Model Management**: Local storage and Hugging Face Hub integration
+- **Model Card Generation**: Automated model card creation
+### 5. Deployment Layer
+- **Demo Space Deployment**: Automated deployment to Hugging Face Spaces
+- **Interactive Demo**: Live demo interface for testing fine-tuned models
+### 6. External Services
+- **Hugging Face Hub**: Model and dataset storage and sharing
+- **NVIDIA Granary**: High-quality multilingual ASR dataset
+- **Trackio Spaces**: Experiment tracking and visualization
+## Key Workflows
+1. **Dataset Creation**: Users can record audio or upload files → processed into JSONL format
+2. **Model Training**: Datasets fed into training scripts with experiment tracking
+3. **Model Publishing**: Trained models pushed to HF Hub with generated model cards
+4. **Demo Deployment**: Automated deployment of interactive demos to HF Spaces
+See also:
+- [Interface Workflow](interface-workflow.md)
+- [Training Pipeline](training-pipeline.md)
+- [Deployment Pipeline](deployment-pipeline.md)
+- [Data Flow](data-flow.md)

docs/data-flow.md ADDED Viewed

	@@ -0,0 +1,374 @@

+# Data Flow
+```mermaid
+flowchart TD
+    %% User Input Sources
+    subgraph "User Input"
+        MIC[🎤 Microphone Recording<br/>Raw audio + timestamps]
+        FILE[📁 File Upload<br/>WAV/FLAC files]
+        TEXT[📝 Manual Transcripts<br/>Text input]
+        LANG[🌍 Language Selection<br/>25+ languages]
+    end
+    %% Data Processing Pipeline
+    subgraph "Data Processing"
+        AUDIO_PROC[Audio Processing<br/>Resampling to 16kHz<br/>Format conversion]
+        TEXT_PROC[Text Processing<br/>Transcript validation<br/>Cleaning & formatting]
+        JSONL_CONV[JSONL Conversion<br/>{"audio_path": "...", "text": "..."}]
+    end
+    %% Dataset Storage
+    subgraph "Dataset Storage"
+        LOCAL_DS[Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/]
+        HF_DS[HF Hub Dataset<br/>username/dataset-name<br/>Public sharing]
+    end
+    %% Training Data Flow
+    subgraph "Training Data Pipeline"
+        DS_LOADER[Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()]
+        AUDIO_CAST[Audio Casting<br/>Audio(sampling_rate=16000)]
+        TRAIN_SPLIT[Train Split<br/>train_dataset]
+        EVAL_SPLIT[Eval Split<br/>eval_dataset]
+    end
+    %% Model Training
+    subgraph "Model Training"
+        COLLATOR[VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction]
+        FORWARD[Forward Pass<br/>Audio → Features → Text]
+        LOSS[Loss Calculation<br/>Masked LM loss]
+        BACKWARD[Backward Pass<br/>Gradient computation]
+        OPTIMIZE[Parameter Updates<br/>LoRA or full fine-tuning]
+    end
+    %% Training Outputs
+    subgraph "Training Outputs"
+        MODEL_FILES[Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json]
+        TRAINING_LOGS[Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves]
+        CHECKPOINTS[Checkpoints<br/>Intermediate models<br/>best model tracking]
+    end
+    %% Publishing Pipeline
+    subgraph "Publishing Pipeline"
+        HF_REPO[HF Repository<br/>username/model-name<br/>Model hosting]
+        MODEL_CARD[Model Card<br/>README.md<br/>Training details<br/>Usage examples]
+        METADATA[Training Metadata<br/>Config + results<br/>Performance metrics]
+    end
+    %% Demo Deployment
+    subgraph "Demo Deployment"
+        SPACE_REPO[HF Space Repository<br/>username/model-name-demo<br/>Demo hosting]
+        DEMO_APP[Demo Application<br/>Gradio interface<br/>Real-time inference]
+        ENV_VARS[Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets]
+    end
+    %% External Data Sources
+    subgraph "External Data Sources"
+        GRANARY[NVIDIA Granary<br/>Multilingual ASR data<br/>25+ languages]
+        HF_COMM[HF Community Datasets<br/>Public ASR datasets<br/>Standard formats]
+    end
+    %% Data Flow Connections
+    MIC --> AUDIO_PROC
+    FILE --> AUDIO_PROC
+    TEXT --> TEXT_PROC
+    LANG --> TEXT_PROC
+    AUDIO_PROC --> JSONL_CONV
+    TEXT_PROC --> JSONL_CONV
+    JSONL_CONV --> LOCAL_DS
+    LOCAL_DS --> HF_DS
+    LOCAL_DS --> DS_LOADER
+    HF_DS --> DS_LOADER
+    GRANARY --> DS_LOADER
+    HF_COMM --> DS_LOADER
+    DS_LOADER --> AUDIO_CAST
+    AUDIO_CAST --> TRAIN_SPLIT
+    AUDIO_CAST --> EVAL_SPLIT
+    TRAIN_SPLIT --> COLLATOR
+    EVAL_SPLIT --> COLLATOR
+    COLLATOR --> FORWARD
+    FORWARD --> LOSS
+    LOSS --> BACKWARD
+    BACKWARD --> OPTIMIZE
+    OPTIMIZE --> MODEL_FILES
+    OPTIMIZE --> TRAINING_LOGS
+    OPTIMIZE --> CHECKPOINTS
+    MODEL_FILES --> HF_REPO
+    TRAINING_LOGS --> HF_REPO
+    CHECKPOINTS --> HF_REPO
+    HF_REPO --> MODEL_CARD
+    TRAINING_LOGS --> MODEL_CARD
+    MODEL_CARD --> SPACE_REPO
+    HF_REPO --> SPACE_REPO
+    ENV_VARS --> SPACE_REPO
+    SPACE_REPO --> DEMO_APP
+    %% Styling
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px
+    class MIC,FILE,TEXT,LANG input
+    class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
+    class LOCAL_DS,HF_DS storage
+    class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
+    class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
+    class HF_REPO,MODEL_CARD,METADATA publishing
+    class SPACE_REPO,DEMO_APP,ENV_VARS deployment
+    class GRANARY,HF_COMM external
+```
+## Data Flow Overview
+This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo.
+### Data Input Sources
+#### User-Generated Data
+- **Microphone Recording**: Raw audio captured through browser microphone
+- **File Upload**: Existing WAV/FLAC audio files
+- **Manual Transcripts**: User-provided text transcriptions
+- **Language Selection**: Influences phrase selection from NVIDIA Granary
+#### External Data Sources
+- **NVIDIA Granary**: High-quality multilingual ASR dataset
+- **HF Community Datasets**: Public datasets from Hugging Face Hub
+### Data Processing Pipeline
+#### Audio Processing
+```python
+# Audio resampling and format conversion
+audio_data = librosa.load(audio_path, sr=16000)
+# Convert to WAV format for consistency
+sf.write(output_path, audio_data, 16000)
+```
+#### Text Processing
+```python
+# Text cleaning and validation
+text = text.strip()
+# Basic validation (length, content checks)
+assert len(text) > 0, "Empty transcription"
+```
+#### JSONL Conversion
+```python
+# Standard format for all datasets
+entry = {
+    "audio_path": str(audio_file_path),
+    "text": cleaned_transcription
+}
+# Write to JSONL file
+with open(jsonl_path, "a") as f:
+    f.write(json.dumps(entry) + "\n")
+```
+### Dataset Storage
+#### Local Storage Structure
+```
+datasets/voxtral_user/
+├── data.jsonl          # Main dataset file
+├── recorded_data.jsonl # From recordings
+└── wavs/              # Audio files
+    ├── recording_0000.wav
+    ├── recording_0001.wav
+    └── ...
+```
+#### HF Hub Storage
+- **Public Datasets**: Shareable with community
+- **Version Control**: Dataset versioning and updates
+- **Standard Metadata**: Automatic README generation
+### Training Data Pipeline
+#### Dataset Loading
+```python
+# Load local JSONL
+ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl")
+# Load HF dataset
+ds = load_dataset("username/dataset-name", split="train")
+```
+#### Audio Casting
+```python
+# Ensure consistent sampling rate
+ds = ds.cast_column("audio", Audio(sampling_rate=16000))
+```
+#### Train/Eval Split
+```python
+# Create train and eval datasets
+train_dataset = ds.select(range(train_count))
+eval_dataset = ds.select(range(train_count, train_count + eval_count))
+```
+### Training Process Flow
+#### Data Collation
+- **VoxtralDataCollator**: Custom collator for Voxtral model
+- **Audio Processing**: Convert audio to model inputs
+- **Prompt Construction**: Build `[AUDIO]...[AUDIO] <transcribe>` prompts
+- **Text Tokenization**: Process transcription targets
+- **Masking**: Mask audio prompt tokens during training
+#### Forward Pass
+1. **Audio Input**: Raw audio waveforms
+2. **Audio Tower**: Extract audio features
+3. **Language Model**: Generate transcription autoregressively
+4. **Loss Calculation**: Compare generated vs target text
+#### Backward Pass & Optimization
+- **Gradient Computation**: Backpropagation
+- **LoRA Updates**: Update only adapter parameters (LoRA mode)
+- **Full Updates**: Update all parameters (full fine-tuning)
+- **Optimizer Step**: Apply gradients with learning rate scheduling
+### Training Outputs
+#### Model Files
+- **model.safetensors**: Model weights (safetensors format)
+- **config.json**: Model configuration
+- **tokenizer.json**: Tokenizer configuration
+- **generation_config.json**: Generation parameters
+#### Training Logs
+- **train_results.json**: Final training metrics
+- **eval_results.json**: Evaluation results
+- **training_config.json**: Training hyperparameters
+- **trainer_state.json**: Training state and checkpoints
+#### Checkpoints
+- **checkpoint-XXX/**: Intermediate model snapshots
+- **best-model/**: Best performing model
+- **final-model/**: Final trained model
+### Publishing Pipeline
+#### HF Repository Structure
+```
+username/model-name/
+├── model.safetensors.index.json
+├── model-00001-of-00002.safetensors
+├── model-00002-of-00002.safetensors
+├── config.json
+├── tokenizer.json
+├── training_config.json
+├── train_results.json
+├── README.md (model card)
+└── training_results/
+    └── training.log
+```
+#### Model Card Generation
+- **Template Processing**: Fill model_card.md template
+- **Variable Injection**: Training config, results, metadata
+- **Conditional Sections**: Handle quantized models, etc.
+### Demo Deployment
+#### Space Repository Structure
+```
+username/model-name-demo/
+├── app.py              # Gradio demo application
+├── requirements.txt    # Python dependencies
+├── README.md          # Space documentation
+└── .env               # Environment variables
+```
+#### Environment Configuration
+```python
+# Space environment variables
+HF_MODEL_ID=username/model-name
+MODEL_NAME=Voxtral Fine-tuned Model
+HF_TOKEN=read_only_token  # For model access
+BRAND_OWNER_NAME=username
+# ... other branding variables
+```
+### Data Flow Patterns
+#### Streaming vs Batch Processing
+- **Training Data**: Batch processing for efficiency
+- **External Datasets**: Streaming loading for memory efficiency
+- **User Input**: Real-time processing with immediate feedback
+#### Data Validation
+- **Input Validation**: Check audio format, sampling rate, text length
+- **Quality Assurance**: Filter out empty or invalid entries
+- **Consistency Checks**: Ensure audio-text alignment
+#### Error Handling
+- **Graceful Degradation**: Fallback to local data if external sources fail
+- **Retry Logic**: Automatic retry for network failures
+- **Logging**: Comprehensive error logging and debugging
+### Performance Considerations
+#### Memory Management
+- **Streaming Loading**: Process large datasets without loading everything
+- **Audio Caching**: Cache processed audio features
+- **Batch Optimization**: Balance batch size with available memory
+#### Storage Optimization
+- **Compression**: Use efficient audio formats
+- **Deduplication**: Avoid duplicate data entries
+- **Cleanup**: Remove temporary files after processing
+#### Network Efficiency
+- **Incremental Uploads**: Upload files as they're ready
+- **Resume Capability**: Resume interrupted uploads
+- **Caching**: Cache frequently accessed data
+### Security & Privacy
+#### Data Privacy
+- **Local Processing**: Audio files processed locally when possible
+- **User Consent**: Clear data usage policies
+- **Anonymization**: Remove personally identifiable information
+#### Access Control
+- **Token Management**: Secure HF token storage
+- **Repository Permissions**: Appropriate public/private settings
+- **Rate Limiting**: Prevent abuse of demo interfaces
+### Monitoring & Analytics
+#### Data Quality Metrics
+- **Audio Quality**: Sampling rate, format validation
+- **Text Quality**: Length, language detection, consistency
+- **Dataset Statistics**: Size, distribution, coverage
+#### Performance Metrics
+- **Processing Time**: Data loading, preprocessing, training time
+- **Model Metrics**: Loss, perplexity, WER (if available)
+- **Resource Usage**: Memory, CPU/GPU utilization
+#### User Analytics
+- **Usage Patterns**: Popular languages, dataset sizes
+- **Success Rates**: Training completion, deployment success
+- **Error Patterns**: Common failure modes and solutions
+See also:
+- [Architecture Overview](architecture.md)
+- [Interface Workflow](interface-workflow.md)
+- [Training Pipeline](training-pipeline.md)

docs/deployment-pipeline.md ADDED Viewed

	@@ -0,0 +1,323 @@

+# Deployment Pipeline
+```mermaid
+graph TB
+    %% Input Sources
+    subgraph "Inputs"
+        TRAINED_MODEL[Trained Model<br/>Local directory]
+        TRAINING_CONFIG[Training Config<br/>JSON/YAML]
+        TRAINING_RESULTS[Training Results<br/>Metrics & logs]
+        MODEL_METADATA[Model Metadata<br/>Name, description, etc.]
+    end
+    %% Model Publishing
+    subgraph "Model Publishing"
+        PUSH_SCRIPT[push_to_huggingface.py<br/>Model Publisher]
+        subgraph "Publishing Steps"
+            REPO_CREATION[Repository Creation<br/>HF Hub API]
+            FILE_UPLOAD[File Upload<br/>Model files to HF]
+            METADATA_UPLOAD[Metadata Upload<br/>Config & results]
+        end
+    end
+    %% Model Card Generation
+    subgraph "Model Card Generation"
+        CARD_SCRIPT[generate_model_card.py<br/>Card Generator]
+        subgraph "Card Components"
+            TEMPLATE_LOAD[Template Loading<br/>model_card.md]
+            VARIABLE_REPLACEMENT[Variable Replacement<br/>Config injection]
+            CONDITIONAL_PROCESSING[Conditional Sections<br/>Quantized models, etc.]
+        end
+    end
+    %% Demo Space Deployment
+    subgraph "Demo Space Deployment"
+        DEPLOY_SCRIPT[deploy_demo_space.py<br/>Space Deployer]
+        subgraph "Space Setup"
+            SPACE_CREATION[Space Repository<br/>Create HF Space]
+            TEMPLATE_COPY[Template Copying<br/>demo_voxtral/ files]
+            ENV_INJECTION[Environment Setup<br/>Model config injection]
+            SECRET_SETUP[Secret Configuration<br/>HF_TOKEN, model vars]
+        end
+    end
+    %% Space Building & Testing
+    subgraph "Space Building"
+        BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
+        DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
+        MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
+        APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
+    end
+    %% Live Demo
+    subgraph "Live Demo Space"
+        GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
+        MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
+        USER_INTERACTION[User Interaction<br/>Audio upload/playback]
+    end
+    %% External Services
+    subgraph "External Services"
+        HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
+        HF_SPACES[HF Spaces Platform<br/>Demo hosting]
+    end
+    %% Flow Connections
+    TRAINED_MODEL --> PUSH_SCRIPT
+    TRAINING_CONFIG --> PUSH_SCRIPT
+    TRAINING_RESULTS --> PUSH_SCRIPT
+    MODEL_METADATA --> PUSH_SCRIPT
+    PUSH_SCRIPT --> REPO_CREATION
+    REPO_CREATION --> FILE_UPLOAD
+    FILE_UPLOAD --> METADATA_UPLOAD
+    METADATA_UPLOAD --> CARD_SCRIPT
+    TRAINING_CONFIG --> CARD_SCRIPT
+    TRAINING_RESULTS --> CARD_SCRIPT
+    CARD_SCRIPT --> TEMPLATE_LOAD
+    TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
+    VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
+    CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
+    METADATA_UPLOAD --> DEPLOY_SCRIPT
+    DEPLOY_SCRIPT --> SPACE_CREATION
+    SPACE_CREATION --> TEMPLATE_COPY
+    TEMPLATE_COPY --> ENV_INJECTION
+    ENV_INJECTION --> SECRET_SETUP
+    SECRET_SETUP --> BUILD_TRIGGER
+    BUILD_TRIGGER --> DEPENDENCY_INSTALL
+    DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
+    MODEL_DOWNLOAD --> APP_INITIALIZATION
+    APP_INITIALIZATION --> GRADIO_INTERFACE
+    GRADIO_INTERFACE --> MODEL_INFERENCE
+    MODEL_INFERENCE --> USER_INTERACTION
+    HF_HUB --> MODEL_DOWNLOAD
+    HF_SPACES --> GRADIO_INTERFACE
+    %% Styling
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
+    class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
+    class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
+    class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
+    class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
+    class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
+    class HF_HUB,HF_SPACES external
+```
+## Deployment Pipeline Overview
+This diagram illustrates the complete deployment pipeline that takes a trained Voxtral model and makes it available as an interactive demo on Hugging Face Spaces.
+### Input Sources
+#### Trained Model Artifacts
+- **Model Files**: `model.safetensors`, `config.json`, `tokenizer.json`
+- **Training Config**: Hyperparameters and training setup
+- **Training Results**: Metrics, loss curves, evaluation results
+- **Model Metadata**: Name, description, base model information
+### Model Publishing Phase
+#### push_to_huggingface.py Script
+```python
+# Initialize publisher
+pusher = HuggingFacePusher(
+    model_path=output_dir,
+    repo_name=repo_name,
+    token=hf_token
+)
+# Push model
+success = pusher.push_model(training_config, results)
+```
+#### Publishing Steps
+1. **Repository Creation**: Create HF Hub repository
+2. **File Upload**: Upload all model files
+3. **Metadata Upload**: Upload training config and results
+### Model Card Generation
+#### generate_model_card.py Script
+```python
+# Create generator
+generator = ModelCardGenerator()
+# Generate card
+variables = {
+    "model_name": model_name,
+    "repo_name": repo_id,
+    "base_model": base_model,
+    # ... other variables
+}
+content = generator.generate_model_card(variables)
+```
+#### Card Processing
+1. **Template Loading**: Load from `templates/model_card.md`
+2. **Variable Replacement**: Inject actual values
+3. **Conditional Processing**: Handle optional sections
+### Demo Space Deployment
+#### deploy_demo_space.py Script
+```python
+# Initialize deployer
+deployer = DemoSpaceDeployer(
+    hf_token=token,
+    hf_username=username,
+    model_id=model_id,
+    demo_type="voxtral"
+)
+# Deploy space
+success = deployer.deploy()
+```
+#### Space Setup Process
+1. **Space Creation**: Create HF Space repository
+2. **Template Copying**: Copy demo template files
+3. **Environment Injection**: Set model-specific variables
+4. **Secret Configuration**: Configure HF_TOKEN and model variables
+### Space Building Process
+#### Automatic Build Trigger
+- **Dependency Installation**: `pip install -r requirements.txt`
+- **Model Download**: Download model from HF Hub
+- **App Initialization**: Setup Gradio application
+#### Demo Template Structure
+```
+templates/spaces/demo_voxtral/
+├── app.py              # Main Gradio application
+├── requirements.txt    # Python dependencies
+└── README.md          # Space documentation
+```
+### Live Demo Features
+#### Gradio Interface
+- **Audio Upload**: File upload or recording
+- **Real-time Inference**: Live ASR transcription
+- **Interactive Controls**: Model parameters, settings
+#### Model Inference Pipeline
+- **Audio Processing**: Convert to model inputs
+- **Transcription Generation**: Run ASR inference
+- **Result Display**: Show transcription with confidence
+### Configuration Management
+#### Environment Variables
+```python
+# Set in Space secrets/environment
+os.environ['HF_MODEL_ID'] = model_id
+os.environ['MODEL_NAME'] = model_name
+os.environ['HF_TOKEN'] = token  # For model access
+```
+#### Demo-Specific Settings
+- **Model Configuration**: Base model, subfolder, quantization
+- **UI Branding**: Custom titles, descriptions, links
+- **Example Prompts**: Pre-configured demo examples
+### Error Handling & Monitoring
+#### Build Process Monitoring
+- **Build Logs**: Real-time build status
+- **Error Detection**: Failed dependency installation
+- **Retry Logic**: Automatic rebuild on failure
+#### Runtime Monitoring
+- **Space Health**: Uptime and responsiveness
+- **Model Loading**: Successful model initialization
+- **Inference Errors**: Runtime error handling
+### Security Considerations
+#### Token Management
+- **Read-Only Tokens**: Use read-only tokens for demo spaces
+- **Secret Storage**: Secure storage of HF_TOKEN
+- **Access Control**: Proper repository permissions
+#### Resource Management
+- **Memory Limits**: Space hardware constraints
+- **Timeout Handling**: Inference timeout protection
+- **Rate Limiting**: Prevent abuse
+### Integration Points
+#### With Training Scripts
+- **Training Config**: Used for model card generation
+- **Training Results**: Included in model metadata
+- **Model Path**: Direct path to trained model files
+#### With Interface (interface.py)
+- **Parameter Passing**: Deployment settings from UI
+- **Progress Updates**: Deployment progress to user
+- **Result Links**: Direct links to deployed spaces
+### Deployment Workflows
+#### Full Pipeline (Recommended)
+1. Train model → Generate model card → Push to Hub → Deploy demo
+2. All steps automated through single interface action
+3. Comprehensive error handling and rollback
+#### Manual Deployment
+1. Use individual scripts for granular control
+2. Custom configuration and branding
+3. Debugging and troubleshooting capabilities
+#### CI/CD Integration
+- **Automated Triggers**: GitHub Actions integration
+- **Version Control**: Model versioning and releases
+- **Testing**: Automated demo testing
+### Performance Optimization
+#### Space Hardware Selection
+- **CPU Basic**: Free tier, sufficient for small models
+- **GPU Options**: For larger models requiring acceleration
+- **Memory Scaling**: Based on model size requirements
+#### Model Optimization
+- **Quantization**: 4-bit quantization for smaller footprint
+- **Model Sharding**: Split large models across memory
+- **Caching**: Model caching for faster cold starts
+### Monitoring & Analytics
+#### Space Analytics
+- **Usage Metrics**: Daily active users, session duration
+- **Performance Metrics**: Inference latency, error rates
+- **User Feedback**: Demo effectiveness and issues
+#### Model Analytics
+- **Download Stats**: Model popularity and usage
+- **Citation Tracking**: Academic and research usage
+- **Community Feedback**: GitHub issues and discussions
+See also:
+- [Architecture Overview](architecture.md)
+- [Training Pipeline](training-pipeline.md)
+- [Data Flow](data-flow.md)

docs/diagrams.html ADDED Viewed

	@@ -0,0 +1,728 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Voxtral ASR Fine-tuning - Architecture Diagrams</title>
+    <script type="module">
+        import mermaid from 'https://cdn.jsdelivr.net/npm/[email protected]/dist/mermaid.esm.min.mjs';
+        mermaid.initialize({
+            startOnLoad: true,
+            theme: 'base',
+            themeVariables: {
+                primaryColor: '#e3f2fd',
+                primaryTextColor: '#1976d2',
+                primaryBorderColor: '#01579b',
+                lineColor: '#424242',
+                secondaryColor: '#fff3e0',
+                tertiaryColor: '#fce4ec',
+                background: '#ffffff',
+                mainBkg: '#ffffff',
+                secondBkg: '#f5f5f5',
+                textColor: '#333333'
+            },
+            flowchart: {
+                useMaxWidth: true,
+                htmlLabels: true,
+                curve: 'basis'
+            },
+            sequence: {
+                useMaxWidth: true
+            }
+        });
+    </script>
+    <style>
+        body {
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+            line-height: 1.6;
+            color: #333;
+            max-width: 1200px;
+            margin: 0 auto;
+            padding: 20px;
+            background: #f8f9fa;
+        }
+        .header {
+            text-align: center;
+            margin-bottom: 40px;
+            padding: 20px;
+            background: white;
+            border-radius: 8px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+        .diagram-container {
+            background: white;
+            margin: 20px 0;
+            padding: 20px;
+            border-radius: 8px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+        .diagram-title {
+            font-size: 1.5em;
+            font-weight: bold;
+            margin-bottom: 15px;
+            color: #1976d2;
+            border-bottom: 2px solid #e3f2fd;
+            padding-bottom: 10px;
+        }
+        .diagram-description {
+            margin-bottom: 20px;
+            color: #666;
+            font-style: italic;
+        }
+        .navigation {
+            position: fixed;
+            top: 20px;
+            right: 20px;
+            background: white;
+            padding: 15px;
+            border-radius: 8px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+            max-width: 200px;
+        }
+        .nav-link {
+            display: block;
+            padding: 8px 0;
+            color: #1976d2;
+            text-decoration: none;
+            border-bottom: 1px solid #eee;
+        }
+        .nav-link:hover {
+            color: #01579b;
+            text-decoration: underline;
+        }
+        .nav-link:last-child {
+            border-bottom: none;
+        }
+        .code-toggle {
+            background: #f5f5f5;
+            border: 1px solid #ddd;
+            padding: 10px;
+            margin: 10px 0;
+            border-radius: 4px;
+            cursor: pointer;
+            font-size: 0.9em;
+        }
+        .mermaid-code {
+            display: none;
+            background: #f8f9fa;
+            border: 1px solid #dee2e6;
+            border-radius: 4px;
+            padding: 15px;
+            margin: 10px 0;
+            font-family: 'Courier New', monospace;
+            font-size: 0.85em;
+            white-space: pre-wrap;
+            overflow-x: auto;
+        }
+        .download-btn {
+            background: #1976d2;
+            color: white;
+            border: none;
+            padding: 8px 16px;
+            border-radius: 4px;
+            cursor: pointer;
+            font-size: 0.9em;
+            margin: 10px 5px 10px 0;
+        }
+        .download-btn:hover {
+            background: #01579b;
+        }
+        @media print {
+            .navigation, .code-toggle, .download-btn {
+                display: none;
+            }
+            .diagram-container {
+                break-inside: avoid;
+                margin: 10px 0;
+            }
+        }
+    </style>
+</head>
+<body>
+    <div class="header">
+        <h1>🎯 Voxtral ASR Fine-tuning</h1>
+        <h2>Architecture & Workflow Diagrams</h2>
+        <p>Interactive documentation with Mermaid diagrams</p>
+    </div>
+    <nav class="navigation">
+        <strong>Quick Navigation</strong>
+        <a href="#overview" class="nav-link">Overview</a>
+        <a href="#architecture" class="nav-link">Architecture</a>
+        <a href="#interface" class="nav-link">Interface Workflow</a>
+        <a href="#training" class="nav-link">Training Pipeline</a>
+        <a href="#deployment" class="nav-link">Deployment Pipeline</a>
+        <a href="#dataflow" class="nav-link">Data Flow</a>
+    </nav>
+    <div id="overview" class="diagram-container">
+        <div class="diagram-title">📋 Documentation Overview</div>
+        <div class="diagram-description">
+            High-level overview of the Voxtral ASR Fine-tuning application and its documentation structure.
+        </div>
+        <div class="mermaid">
+graph TD
+    START(["Voxtral ASR Fine-tuning App"]) --> OVERVIEW{Choose Documentation}
+    OVERVIEW --> ARCH["Architecture Overview"]
+    OVERVIEW --> WORKFLOW["Interface Workflow"]
+    OVERVIEW --> TRAINING["Training Pipeline"]
+    OVERVIEW --> DEPLOYMENT["Deployment Pipeline"]
+    OVERVIEW --> DATAFLOW["Data Flow"]
+    ARCH --> ARCH_DIAG["High-level Architecture<br/>System Components & Layers"]
+    WORKFLOW --> WORKFLOW_DIAG["User Journey<br/>Recording → Training → Demo"]
+    TRAINING --> TRAINING_DIAG["Training Scripts<br/>Data → Model → Results"]
+    DEPLOYMENT --> DEPLOYMENT_DIAG["Publishing & Demo<br/>Model → Hub → Space"]
+    DATAFLOW --> DATAFLOW_DIAG["Complete Data Journey<br/>Input → Processing → Output"]
+    subgraph "Core Components"
+        INTERFACE["interface.py<br/>Gradio Web UI"]
+        TRAIN_SCRIPTS["scripts/train*.py<br/>Training Scripts"]
+        DEPLOY_SCRIPT["scripts/deploy_demo_space.py<br/>Demo Deployment"]
+        PUSH_SCRIPT["scripts/push_to_huggingface.py<br/>Model Publishing"]
+    end
+    subgraph "Key Data Formats"
+        JSONL["JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
+        HFDATA["HF Hub Models<br/>username/model-name"]
+        SPACES["HF Spaces<br/>Interactive Demos"]
+    end
+    INTERFACE --> WORKFLOW
+    TRAIN_SCRIPTS --> TRAINING
+    DEPLOY_SCRIPT --> DEPLOYMENT
+    PUSH_SCRIPT --> DEPLOYMENT
+    JSONL --> DATAFLOW
+    HFDATA --> DEPLOYMENT
+    SPACES --> DEPLOYMENT
+    classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
+    classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    class START entry
+    class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
+    class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
+    class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
+    class JSONL,HFDATA,SPACES data
+        </div>
+    </div>
+    <div id="architecture" class="diagram-container">
+        <div class="diagram-title">System Architecture</div>
+        <div class="diagram-description">
+            High-level architecture showing the main components and their relationships in the Voxtral ASR Fine-tuning application.
+        </div>
+        <div class="mermaid">
+graph TB
+    subgraph "User Interface"
+        UI["Gradio Web Interface<br/>interface.py"]
+        REC["Audio Recording<br/>Microphone Input"]
+        UP["File Upload<br/>WAV/FLAC files"]
+    end
+    subgraph "Data Processing"
+        DP["Data Processing<br/>Audio resampling<br/>JSONL creation"]
+        DS["Dataset Management<br/>NVIDIA Granary<br/>Local datasets"]
+    end
+    subgraph "Training Pipeline"
+        TF["Full Fine-tuning<br/>scripts/train.py"]
+        TL["LoRA Fine-tuning<br/>scripts/train_lora.py"]
+        TI["Trackio Integration<br/>Experiment Tracking"]
+    end
+    subgraph "Model Management"
+        MM["Model Management<br/>Hugging Face Hub<br/>Local storage"]
+        MC["Model Card Generation<br/>scripts/generate_model_card.py"]
+    end
+    subgraph "Deployment &amp; Demo"
+        DEP["Demo Space Deployment<br/>scripts/deploy_demo_space.py"]
+        HF["HF Spaces<br/>Interactive Demo"]
+    end
+    subgraph "External Services"
+        HFH["Hugging Face Hub<br/>Models & Datasets"]
+        GRAN["NVIDIA Granary<br/>Multilingual ASR Dataset"]
+        TRACK["Trackio Spaces<br/>Experiment Tracking"]
+    end
+    UI --> DP
+    REC --> DP
+    UP --> DP
+    DP --> DS
+    DS --> TF
+    DS --> TL
+    TF --> TI
+    TL --> TI
+    TF --> MM
+    TL --> MM
+    MM --> MC
+    MM --> DEP
+    DEP --> HF
+    DS -.-> HFH
+    MM -.-> HFH
+    TI -.-> TRACK
+    DS -.-> GRAN
+    classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
+    classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
+    classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
+    classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
+    classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
+    classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class UI,REC,UP interface
+    class DP,DS processing
+    class TF,TL,TI training
+    class MM,MC management
+    class DEP,HF deployment
+    class HFH,GRAN,TRACK external
+        </div>
+    </div>
+    <div id="interface" class="diagram-container">
+        <div class="diagram-title">Interface Workflow</div>
+        <div class="diagram-description">
+            Complete user journey through the Voxtral ASR Fine-tuning interface, from language selection to demo deployment.
+        </div>
+        <div class="mermaid">
+flowchart TD
+    START(["User Opens Interface"]) --> LANG["Language Selection<br/>Choose from 25+ languages"]
+    LANG --> PHRASES["Load Phrases<br/>From NVIDIA Granary"]
+    PHRASES --> RECORD["Recording Interface<br/>Display phrases + audio recording"]
+    RECORD --> |User Records| PROCESS_REC["Process Recordings<br/>Save WAV files + transcripts"]
+    RECORD --> |Upload Files| PROCESS_UPLOAD["Process Uploads<br/>Handle existing files + transcripts"]
+    PROCESS_REC --> JSONL["Create JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
+    PROCESS_UPLOAD --> JSONL
+    JSONL --> CONFIG["Training Configuration<br/>Model, LoRA/full, hyperparameters"]
+    CONFIG --> TRAIN["Training Process<br/>Execute train.py or train_lora.py"]
+    TRAIN --> PUSH["Push to Hub<br/>Model + metadata to HF Hub"]
+    TRAIN --> CARD["Generate Model Card<br/>Automated documentation"]
+    PUSH --> DEPLOY["Deploy Demo Space<br/>Interactive demo on HF Spaces"]
+    DEPLOY --> END(["Demo Ready<br/>Interactive ASR Demo"])
+    PUSH -.-> END
+    CARD -.-> END
+    classDef start fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
+    classDef process fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef decision fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef terminal fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
+    class START start
+    class END terminal
+    class LANG,PHRASES,RECORD,PROCESS_REC,PROCESS_UPLOAD,JSONL,CONFIG,TRAIN,PUSH,CARD,DEPLOY process
+        </div>
+    </div>
+    <div id="training" class="diagram-container">
+        <div class="diagram-title">Training Pipeline</div>
+        <div class="diagram-description">
+            Detailed training pipeline showing how data flows through training scripts and supporting infrastructure.
+        </div>
+        <div class="mermaid">
+graph TB
+    subgraph "Data Sources"
+        JSONL["JSONL Dataset<br/>{'audio_path': '...', 'text': '...'}"]
+        GRANARY["NVIDIA Granary Dataset<br/>Multilingual ASR Data"]
+        HFDATA["HF Hub Datasets<br/>Community Datasets"]
+    end
+    subgraph "Data Processing"
+        LOADER["Dataset Loader<br/>_load_jsonl_dataset()"]
+        CASTER["Audio Casting<br/>16kHz resampling"]
+        COLLATOR["VoxtralDataCollator<br/>Audio + Text Processing"]
+    end
+    subgraph "Training Scripts"
+        TRAIN_FULL["Full Fine-tuning<br/>scripts/train.py"]
+        TRAIN_LORA["LoRA Fine-tuning<br/>scripts/train_lora.py"]
+        subgraph "Training Components"
+            MODEL_INIT["Model Initialization<br/>VoxtralForConditionalGeneration"]
+            LORA_CONFIG["LoRA Configuration<br/>LoraConfig + get_peft_model"]
+            PROCESSOR_INIT["Processor Initialization<br/>VoxtralProcessor"]
+        end
+    end
+    subgraph "Training Infrastructure"
+        TRACKIO_INIT["Trackio Integration<br/>Experiment Tracking"]
+        HF_TRAINER["Hugging Face Trainer<br/>TrainingArguments + Trainer"]
+        TORCH_DEVICE["Torch Device Setup<br/>GPU/CPU Detection"]
+    end
+    subgraph "Training Process"
+        FORWARD_PASS["Forward Pass<br/>Audio Processing + Generation"]
+        LOSS_CALC["Loss Calculation<br/>Masked Language Modeling"]
+        BACKWARD_PASS["Backward Pass<br/>Gradient Computation"]
+        OPTIMIZER_STEP["Optimizer Step<br/>Parameter Updates"]
+        LOGGING["Metrics Logging<br/>Loss, Perplexity, etc."]
+    end
+    subgraph "Model Management"
+        CHECKPOINT_SAVING["Checkpoint Saving<br/>Model snapshots"]
+        MODEL_SAVING["Final Model Saving<br/>Processor + Model"]
+        LOCAL_STORAGE["Local Storage<br/>outputs/ directory"]
+    end
+    LOADER --> CASTER
+    CASTER --> COLLATOR
+    COLLATOR --> TRAIN_FULL
+    COLLATOR --> TRAIN_LORA
+    TRAIN_FULL --> MODEL_INIT
+    TRAIN_LORA --> MODEL_INIT
+    TRAIN_LORA --> LORA_CONFIG
+    MODEL_INIT --> PROCESSOR_INIT
+    LORA_CONFIG --> PROCESSOR_INIT
+    PROCESSOR_INIT --> TRACKIO_INIT
+    PROCESSOR_INIT --> HF_TRAINER
+    PROCESSOR_INIT --> TORCH_DEVICE
+    TRACKIO_INIT --> HF_TRAINER
+    TORCH_DEVICE --> HF_TRAINER
+    HF_TRAINER --> FORWARD_PASS
+    FORWARD_PASS --> LOSS_CALC
+    LOSS_CALC --> BACKWARD_PASS
+    BACKWARD_PASS --> OPTIMIZER_STEP
+    OPTIMIZER_STEP --> LOGGING
+    LOGGING --> CHECKPOINT_SAVING
+    LOGGING --> TRACKIO_INIT
+    HF_TRAINER --> MODEL_SAVING
+    MODEL_SAVING --> LOCAL_STORAGE
+    JSONL --> LOADER
+    GRANARY --> LOADER
+    HFDATA --> LOADER
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class JSONL,GRANARY,HFDATA input
+    class LOADER,CASTER,COLLATOR processing
+    class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
+    class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
+    class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
+    class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
+        </div>
+    </div>
+    <div id="deployment" class="diagram-container">
+        <div class="diagram-title">Deployment Pipeline</div>
+        <div class="diagram-description">
+            Model publishing and demo deployment process from trained model to live interactive demo.
+        </div>
+        <div class="mermaid">
+graph TB
+    subgraph "Inputs"
+        TRAINED_MODEL["Trained Model<br/>Local directory"]
+        TRAINING_CONFIG["Training Config<br/>JSON/YAML"]
+        TRAINING_RESULTS["Training Results<br/>Metrics & logs"]
+        MODEL_METADATA["Model Metadata<br/>Name, description, etc."]
+    end
+    subgraph "Model Publishing"
+        PUSH_SCRIPT["push_to_huggingface.py<br/>Model Publisher"]
+        subgraph "Publishing Steps"
+            REPO_CREATION["Repository Creation<br/>HF Hub API"]
+            FILE_UPLOAD["File Upload<br/>Model files to HF"]
+            METADATA_UPLOAD["Metadata Upload<br/>Config & results"]
+        end
+    end
+    subgraph "Model Card Generation"
+        CARD_SCRIPT["generate_model_card.py<br/>Card Generator"]
+        subgraph "Card Components"
+            TEMPLATE_LOAD["Template Loading<br/>model_card.md"]
+            VARIABLE_REPLACEMENT["Variable Replacement<br/>Config injection"]
+            CONDITIONAL_PROCESSING["Conditional Sections<br/>Quantized models, etc."]
+        end
+    end
+    subgraph "Demo Space Deployment"
+        DEPLOY_SCRIPT["deploy_demo_space.py<br/>Space Deployer"]
+        subgraph "Space Setup"
+            SPACE_CREATION["Space Repository<br/>Create HF Space"]
+            TEMPLATE_COPY["Template Copying<br/>demo_voxtral/ files"]
+            ENV_INJECTION["Environment Setup<br/>Model config injection"]
+            SECRET_SETUP["Secret Configuration<br/>HF_TOKEN, model vars"]
+        end
+    end
+    subgraph "Space Building"
+        BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
+        DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
+        MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
+        APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
+    end
+    subgraph "Live Demo Space"
+        GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
+        MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
+        USER_INTERACTION[User Interaction<br/>Audio upload/playback]
+    end
+    subgraph "External Services"
+        HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
+        HF_SPACES[HF Spaces Platform<br/>Demo hosting]
+    end
+    TRAINED_MODEL --> PUSH_SCRIPT
+    TRAINING_CONFIG --> PUSH_SCRIPT
+    TRAINING_RESULTS --> PUSH_SCRIPT
+    MODEL_METADATA --> PUSH_SCRIPT
+    PUSH_SCRIPT --> REPO_CREATION
+    REPO_CREATION --> FILE_UPLOAD
+    FILE_UPLOAD --> METADATA_UPLOAD
+    METADATA_UPLOAD --> CARD_SCRIPT
+    TRAINING_CONFIG --> CARD_SCRIPT
+    TRAINING_RESULTS --> CARD_SCRIPT
+    CARD_SCRIPT --> TEMPLATE_LOAD
+    TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
+    VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
+    CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
+    METADATA_UPLOAD --> DEPLOY_SCRIPT
+    DEPLOY_SCRIPT --> SPACE_CREATION
+    SPACE_CREATION --> TEMPLATE_COPY
+    TEMPLATE_COPY --> ENV_INJECTION
+    ENV_INJECTION --> SECRET_SETUP
+    SECRET_SETUP --> BUILD_TRIGGER
+    BUILD_TRIGGER --> DEPENDENCY_INSTALL
+    DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
+    MODEL_DOWNLOAD --> APP_INITIALIZATION
+    APP_INITIALIZATION --> GRADIO_INTERFACE
+    GRADIO_INTERFACE --> MODEL_INFERENCE
+    MODEL_INFERENCE --> USER_INTERACTION
+    HF_HUB --> MODEL_DOWNLOAD
+    HF_SPACES --> GRADIO_INTERFACE
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
+    class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
+    class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
+    class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
+    class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
+    class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
+    class HF_HUB,HF_SPACES external
+        </div>
+    </div>
+    <div id="dataflow" class="diagram-container">
+        <div class="diagram-title">Data Flow</div>
+        <div class="diagram-description">
+            Complete data journey through the Voxtral ASR Fine-tuning application from user input to deployed demo.
+        </div>
+        <div class="mermaid">
+flowchart TD
+    subgraph "User Input"
+        MIC["Microphone Recording<br/>Raw audio + timestamps"]
+        FILE["File Upload<br/>WAV/FLAC files"]
+        TEXT["Manual Transcripts<br/>Text input"]
+        LANG["Language Selection<br/>25+ languages"]
+    end
+    subgraph "Data Processing"
+        AUDIO_PROC["Audio Processing<br/>Resampling to 16kHz<br/>Format conversion"]
+        TEXT_PROC["Text Processing<br/>Transcript validation<br/>Cleaning & formatting"]
+        JSONL_CONV["JSONL Conversion<br/>{'audio_path': '...', 'text': '...'}"]
+    end
+    subgraph "Dataset Storage"
+        LOCAL_DS["Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/"]
+        HF_DS["HF Hub Dataset<br/>username/dataset-name<br/>Public sharing"]
+    end
+    subgraph "Training Data Pipeline"
+        DS_LOADER["Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()"]
+        AUDIO_CAST["Audio Casting<br/>Audio(sampling_rate=16000)"]
+        TRAIN_SPLIT["Train Split<br/>train_dataset"]
+        EVAL_SPLIT["Eval Split<br/>eval_dataset"]
+    end
+    subgraph "Model Training"
+        COLLATOR["VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction"]
+        FORWARD["Forward Pass<br/>Audio → Features → Text"]
+        LOSS["Loss Calculation<br/>Masked LM loss"]
+        BACKWARD["Backward Pass<br/>Gradient computation"]
+        OPTIMIZE["Parameter Updates<br/>LoRA or full fine-tuning"]
+    end
+    subgraph "Training Outputs"
+        MODEL_FILES["Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json"]
+        TRAINING_LOGS["Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves"]
+        CHECKPOINTS["Checkpoints<br/>Intermediate models<br/>best model tracking"]
+    end
+    subgraph "Publishing Pipeline"
+        HF_REPO["HF Repository<br/>username/model-name<br/>Model hosting"]
+        MODEL_CARD["Model Card<br/>README.md<br/>Training details<br/>Usage examples"]
+        METADATA["Training Metadata<br/>Config + results<br/>Performance metrics"]
+    end
+    subgraph "Demo Deployment"
+        SPACE_REPO["HF Space Repository<br/>username/model-name-demo<br/>Demo hosting"]
+        DEMO_APP["Demo Application<br/>Gradio interface<br/>Real-time inference"]
+        ENV_VARS["Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets"]
+    end
+    MIC --> AUDIO_PROC
+    FILE --> AUDIO_PROC
+    TEXT --> TEXT_PROC
+    LANG --> TEXT_PROC
+    AUDIO_PROC --> JSONL_CONV
+    TEXT_PROC --> JSONL_CONV
+    JSONL_CONV --> LOCAL_DS
+    LOCAL_DS --> HF_DS
+    LOCAL_DS --> DS_LOADER
+    HF_DS --> DS_LOADER
+    DS_LOADER --> AUDIO_CAST
+    AUDIO_CAST --> TRAIN_SPLIT
+    AUDIO_CAST --> EVAL_SPLIT
+    TRAIN_SPLIT --> COLLATOR
+    EVAL_SPLIT --> COLLATOR
+    COLLATOR --> FORWARD
+    FORWARD --> LOSS
+    LOSS --> BACKWARD
+    BACKWARD --> OPTIMIZE
+    OPTIMIZE --> MODEL_FILES
+    OPTIMIZE --> TRAINING_LOGS
+    OPTIMIZE --> CHECKPOINTS
+    MODEL_FILES --> HF_REPO
+    TRAINING_LOGS --> HF_REPO
+    CHECKPOINTS --> HF_REPO
+    HF_REPO --> MODEL_CARD
+    TRAINING_LOGS --> MODEL_CARD
+    MODEL_CARD --> SPACE_REPO
+    HF_REPO --> SPACE_REPO
+    ENV_VARS --> SPACE_REPO
+    SPACE_REPO --> DEMO_APP
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class MIC,FILE,TEXT,LANG input
+    class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
+    class LOCAL_DS,HF_DS storage
+    class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
+    class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
+    class HF_REPO,MODEL_CARD,METADATA publishing
+    class SPACE_REPO,DEMO_APP,ENV_VARS deployment
+        </div>
+    </div>
+    <script>
+        // Toggle mermaid code visibility
+        function toggleCode(diagramId) {
+            const codeBlock = document.querySelector(`#${diagramId} .mermaid-code`);
+            if (codeBlock.style.display === 'none' || codeBlock.style.display === '') {
+                codeBlock.style.display = 'block';
+            } else {
+                codeBlock.style.display = 'none';
+            }
+        }
+        // Add toggle buttons to each diagram
+        document.addEventListener('DOMContentLoaded', function() {
+            const diagrams = document.querySelectorAll('.diagram-container');
+            diagrams.forEach((diagram, index) => {
+                const diagramId = diagram.id;
+                const mermaidDiv = diagram.querySelector('.mermaid');
+                if (mermaidDiv) {
+                    // Create toggle button
+                    const toggleBtn = document.createElement('button');
+                    toggleBtn.className = 'code-toggle';
+                    toggleBtn.textContent = '🔍 Show Mermaid Code';
+                    toggleBtn.onclick = () => toggleCode(diagramId);
+                    // Create code block
+                    const codeBlock = document.createElement('pre');
+                    codeBlock.className = 'mermaid-code';
+                    codeBlock.textContent = mermaidDiv.textContent.trim();
+                    // Insert elements
+                    mermaidDiv.parentNode.insertBefore(toggleBtn, mermaidDiv);
+                    mermaidDiv.parentNode.insertBefore(codeBlock, mermaidDiv.nextSibling);
+                }
+            });
+        });
+        // Print functionality
+        function printDiagrams() {
+            window.print();
+        }
+    </script>
+</body>
+</html>

docs/interface-workflow.md ADDED Viewed

	@@ -0,0 +1,173 @@

+# Interface Workflow
+```mermaid
+stateDiagram-v2
+    [*] --> LanguageSelection: User opens interface
+    state "Language & Dataset Setup" as LangSetup {
+        [*] --> LanguageSelection
+        LanguageSelection --> LoadPhrases: Select language
+        LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
+        DisplayPhrases --> RecordingInterface: Show phrases & recording UI
+        state RecordingInterface {
+            [*] --> ShowInitialRows: Display first 10 phrases
+            ShowInitialRows --> RecordAudio: User can record audio
+            RecordAudio --> AddMoreRows: Optional - add 10 more rows
+            AddMoreRows --> RecordAudio
+        }
+    }
+    RecordingInterface --> DatasetCreation: User finishes recording
+    state "Dataset Creation Options" as DatasetCreation {
+        [*] --> FromRecordings: Create from recorded audio
+        [*] --> FromUploads: Upload existing files
+        FromRecordings --> ProcessRecordings: Save WAV files + transcripts
+        FromUploads --> ProcessUploads: Process uploaded files + transcripts
+        ProcessRecordings --> CreateJSONL: Generate JSONL dataset
+        ProcessUploads --> CreateJSONL
+        CreateJSONL --> DatasetReady: Dataset saved locally
+    }
+    DatasetCreation --> TrainingConfiguration: Dataset ready
+    state "Training Setup" as TrainingConfiguration {
+        [*] --> BasicSettings: Model, LoRA/full, batch size
+        [*] --> AdvancedSettings: Learning rate, epochs, LoRA params
+        BasicSettings --> ConfigureDeployment: Repo name, push options
+        AdvancedSettings --> ConfigureDeployment
+        ConfigureDeployment --> StartTraining: All settings configured
+    }
+    TrainingConfiguration --> TrainingProcess: Start training
+    state "Training Process" as TrainingProcess {
+        [*] --> InitializeTrackio: Setup experiment tracking
+        InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
+        RunTrainingScript --> StreamLogs: Show real-time training logs
+        StreamLogs --> MonitorProgress: Track metrics & checkpoints
+        MonitorProgress --> TrainingComplete: Training finished
+        MonitorProgress --> HandleErrors: Training failed
+        HandleErrors --> RetryOrExit: User can retry or exit
+    }
+    TrainingProcess --> PostTraining: Training complete
+    state "Post-Training Actions" as PostTraining {
+        [*] --> PushToHub: Push model to HF Hub
+        [*] --> GenerateModelCard: Create model card
+        [*] --> DeployDemoSpace: Deploy interactive demo
+        PushToHub --> ModelPublished: Model available on HF Hub
+        GenerateModelCard --> ModelDocumented: Model card created
+        DeployDemoSpace --> DemoReady: Demo space deployed
+    }
+    PostTraining --> [*]: Process complete
+    %% Alternative paths
+    DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
+    PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
+    %% Error handling
+    TrainingProcess --> ErrorRecovery: Handle training errors
+    ErrorRecovery --> RetryTraining: Retry with different settings
+    RetryTraining --> TrainingConfiguration
+    %% Styling and notes
+    note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset
+    note right of RecordingInterface : Users record themselves<br/>reading displayed phrases
+    note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
+    note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc.
+    note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration
+    note right of PostTraining : Automated deployment<br/>pipeline
+```
+## Interface Workflow Overview
+This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process.
+### Key Workflow Stages
+#### 1. Language & Dataset Setup
+- **Language Selection**: Users choose from 25+ European languages supported by NVIDIA Granary
+- **Phrase Loading**: System loads authentic, high-quality phrases in the selected language
+- **Recording Interface**: Dynamic interface showing phrases with audio recording components
+- **Progressive Disclosure**: Users can add more rows as needed (up to 100 recordings)
+#### 2. Dataset Creation
+- **From Recordings**: Process microphone recordings into WAV files and JSONL dataset
+- **From Uploads**: Handle existing WAV/FLAC files with manual transcripts
+- **JSONL Format**: Standard format with `audio_path` and `text` fields
+- **Local Storage**: Datasets stored in `datasets/voxtral_user/` directory
+#### 3. Training Configuration
+- **Basic Settings**: Model selection, LoRA vs full fine-tuning, batch size
+- **Advanced Settings**: Learning rate, epochs, gradient accumulation
+- **LoRA Parameters**: r, alpha, dropout, audio tower freezing options
+- **Repository Setup**: Model naming and Hugging Face Hub integration
+#### 4. Training Process
+- **Trackio Integration**: Automatic experiment tracking setup
+- **Script Execution**: Calls appropriate training script (`train.py` or `train_lora.py`)
+- **Log Streaming**: Real-time display of training progress and metrics
+- **Error Handling**: Graceful handling of training failures with retry options
+#### 5. Post-Training Actions
+- **Model Publishing**: Automatic push to Hugging Face Hub
+- **Model Card Generation**: Automated creation using `generate_model_card.py`
+- **Demo Deployment**: One-click deployment of interactive demo spaces
+### Alternative Paths
+#### Dataset-Only Workflow
+- Users can create and publish datasets without training models
+- Useful for dataset curation and sharing
+#### Error Recovery
+- Training failures trigger error recovery flows
+- Users can retry with modified parameters
+- Comprehensive error logging and debugging information
+### Technical Integration Points
+#### External Services
+- **NVIDIA Granary**: Source of high-quality multilingual ASR data
+- **Hugging Face Hub**: Model and dataset storage and sharing
+- **Trackio Spaces**: Experiment tracking and visualization
+#### Script Integration
+- **interface.py**: Main Gradio application orchestrating the workflow
+- **train.py/train_lora.py**: Core training scripts with Trackio integration
+- **push_to_huggingface.py**: Model/dataset publishing
+- **deploy_demo_space.py**: Automated demo deployment
+- **generate_model_card.py**: Model documentation generation
+### User Experience Features
+#### Progressive Interface Reveal
+- Interface components are revealed as users progress through workflow
+- Reduces cognitive load and guides users step-by-step
+#### Real-time Feedback
+- Live log streaming during training
+- Progress indicators and status updates
+- Immediate feedback on dataset creation and validation
+#### Flexible Input Methods
+- Support for both live recording and file uploads
+- Multiple language options for diverse user needs
+- Scalable recording interface (10-100 samples)
+See also:
+- [Architecture Overview](architecture.md)
+- [Training Pipeline](training-pipeline.md)
+- [Data Flow](data-flow.md)

docs/training-pipeline.md ADDED Viewed

	@@ -0,0 +1,271 @@

+# Training Pipeline
+```mermaid
+graph TB
+    %% Input Data Sources
+    subgraph "Data Sources"
+        JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
+        GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
+        HFDATA[HF Hub Datasets<br/>Community Datasets]
+    end
+    %% Data Processing
+    subgraph "Data Processing"
+        LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
+        CASTER[Audio Casting<br/>16kHz resampling]
+        COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
+    end
+    %% Training Scripts
+    subgraph "Training Scripts"
+        TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
+        TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]
+        subgraph "Training Components"
+            MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
+            LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
+            PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
+        end
+    end
+    %% Training Infrastructure
+    subgraph "Training Infrastructure"
+        TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
+        HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
+        TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
+    end
+    %% Training Process
+    subgraph "Training Process"
+        FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
+        LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
+        BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
+        OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
+        LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
+    end
+    %% Model Management
+    subgraph "Model Management"
+        CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
+        MODEL_SAVING[Final Model Saving<br/>Processor + Model]
+        LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
+    end
+    %% Flow Connections
+    JSONL --> LOADER
+    GRANARY --> LOADER
+    HFDATA --> LOADER
+    LOADER --> CASTER
+    CASTER --> COLLATOR
+    COLLATOR --> TRAIN_FULL
+    COLLATOR --> TRAIN_LORA
+    TRAIN_FULL --> MODEL_INIT
+    TRAIN_LORA --> MODEL_INIT
+    TRAIN_LORA --> LORA_CONFIG
+    MODEL_INIT --> PROCESSOR_INIT
+    LORA_CONFIG --> PROCESSOR_INIT
+    PROCESSOR_INIT --> TRACKIO_INIT
+    PROCESSOR_INIT --> HF_TRAINER
+    PROCESSOR_INIT --> TORCH_DEVICE
+    TRACKIO_INIT --> HF_TRAINER
+    TORCH_DEVICE --> HF_TRAINER
+    HF_TRAINER --> FORWARD_PASS
+    FORWARD_PASS --> LOSS_CALC
+    LOSS_CALC --> BACKWARD_PASS
+    BACKWARD_PASS --> OPTIMIZER_STEP
+    OPTIMIZER_STEP --> LOGGING
+    LOGGING --> CHECKPOINT_SAVING
+    LOGGING --> TRACKIO_INIT
+    HF_TRAINER --> MODEL_SAVING
+    MODEL_SAVING --> LOCAL_STORAGE
+    %% Styling
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class JSONL,GRANARY,HFDATA input
+    class LOADER,CASTER,COLLATOR processing
+    class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
+    class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
+    class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
+    class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
+```
+## Training Pipeline Overview
+This diagram illustrates the complete training pipeline for Voxtral ASR fine-tuning, showing how data flows through the training scripts and supporting infrastructure.
+### Data Input Sources
+#### JSONL Datasets
+- **Local Datasets**: User-created datasets from recordings or uploads
+- **Format**: `{"audio_path": "path/to/audio.wav", "text": "transcription"}`
+- **Processing**: Loaded via `_load_jsonl_dataset()` function
+#### NVIDIA Granary Dataset
+- **Multilingual Support**: 25+ European languages
+- **High Quality**: Curated ASR training data
+- **Streaming**: Efficient loading without full download
+#### Hugging Face Hub Datasets
+- **Community Datasets**: Public datasets from HF Hub
+- **Standard Formats**: Compatible with Voxtral training requirements
+### Data Processing Pipeline
+#### Dataset Loading
+```python
+# Load local JSONL or HF dataset
+ds = _load_jsonl_dataset(jsonl_path)
+# or
+ds = load_dataset(ds_name, ds_cfg, split="test")
+```
+#### Audio Processing
+```python
+# Cast to Audio format with 16kHz resampling
+ds = ds.cast_column("audio", Audio(sampling_rate=16000))
+```
+#### Data Collation
+- **VoxtralDataCollator**: Custom collator for Voxtral training
+- **Audio Processing**: Converts audio to model inputs
+- **Text Tokenization**: Processes transcription text
+- **Masking**: Masks prompt tokens during training
+### Training Script Architecture
+#### Full Fine-tuning (`train.py`)
+- **Complete Model Updates**: All parameters trainable
+- **Higher Memory Requirements**: Full model in memory
+- **Better Convergence**: Can achieve higher accuracy
+#### LoRA Fine-tuning (`train_lora.py`)
+- **Parameter Efficient**: Only LoRA adapters trained
+- **Lower Memory Usage**: Base model frozen
+- **Faster Training**: Fewer parameters to update
+- **Configurable**: r, alpha, dropout parameters
+### Training Infrastructure
+#### Trackio Integration
+```python
+trackio.init(
+    project="voxtral-finetuning",
+    config={...},  # Training parameters
+    space_id=trackio_space
+)
+```
+#### Hugging Face Trainer
+```python
+training_args = TrainingArguments(
+    output_dir=output_dir,
+    per_device_train_batch_size=batch_size,
+    learning_rate=learning_rate,
+    num_train_epochs=epochs,
+    bf16=True,  # BFloat16 for efficiency
+    report_to=["trackio"],
+    # ... other args
+)
+```
+#### Device Management
+- **GPU Detection**: Automatic CUDA/GPU detection
+- **Fallback**: CPU training if no GPU available
+- **Memory Optimization**: Model sharding and gradient checkpointing
+### Training Process Flow
+#### Forward Pass
+1. **Audio Input**: Raw audio waveforms
+2. **Audio Tower**: Audio feature extraction
+3. **Text Generation**: Autoregressive text generation from audio features
+#### Loss Calculation
+- **Masked Language Modeling**: Only transcription tokens contribute to loss
+- **Audio Prompt Masking**: Audio processing tokens are masked out
+- **Cross-Entropy Loss**: Standard language modeling loss
+#### Backward Pass & Optimization
+- **Gradient Computation**: Backpropagation through the model
+- **LoRA Updates**: Only adapter parameters updated (LoRA mode)
+- **Full Updates**: All parameters updated (full fine-tuning)
+### Model Management
+#### Checkpoint Saving
+- **Regular Checkpoints**: Saved every N steps
+- **Best Model Tracking**: Save best model based on validation loss
+- **Resume Capability**: Continue training from checkpoints
+#### Final Model Saving
+```python
+trainer.save_model()  # Saves model and tokenizer
+processor.save_pretrained(output_dir)  # Saves processor
+```
+#### Local Storage Structure
+```
+outputs/
+├── voxtral-finetuned-{timestamp}/
+│   ├── config.json
+│   ├── model.safetensors
+│   ├── tokenizer.json
+│   ├── training_config.json
+│   ├── train_results.json
+│   └── eval_results.json
+```
+### Integration Points
+#### With Interface (`interface.py`)
+- **Parameter Passing**: Training parameters from UI
+- **Log Streaming**: Real-time training logs to UI
+- **Progress Monitoring**: Training progress updates
+#### With Model Publishing (`push_to_huggingface.py`)
+- **Model Upload**: Trained model to HF Hub
+- **Metadata**: Training config and results
+- **Model Cards**: Automatic model card generation
+#### With Demo Deployment (`deploy_demo_space.py`)
+- **Space Creation**: HF Spaces for demos
+- **Model Integration**: Deploy trained model in demo
+- **Configuration**: Demo-specific settings
+### Performance Considerations
+#### Memory Optimization
+- **LoRA**: Significantly reduces memory requirements
+- **Gradient Checkpointing**: Trade compute for memory
+- **Mixed Precision**: BF16/FP16 training
+#### Training Efficiency
+- **Batch Size**: Balanced with gradient accumulation
+- **Learning Rate**: Warmup and decay schedules
+- **Early Stopping**: Prevent overfitting
+#### Monitoring & Debugging
+- **Metrics Tracking**: Loss, perplexity, learning rate
+- **GPU Utilization**: Memory and compute monitoring
+- **Error Handling**: Graceful failure recovery
+See also:
+- [Architecture Overview](architecture.md)
+- [Interface Workflow](interface-workflow.md)
+- [Data Flow](data-flow.md)

scripts/generate_svgs.py ADDED Viewed

	@@ -0,0 +1,135 @@

+#!/usr/bin/env python3
+"""
+Generate SVG versions of Mermaid diagrams for documentation
+"""
+import os
+import re
+import requests
+import json
+from pathlib import Path
+from typing import Optional
+class MermaidToSVGConverter:
+    """Convert Mermaid diagrams to SVG format"""
+    def __init__(self):
+        self.mermaid_api_url = "https://mermaid.ink/img/"
+    def extract_mermaid_code(self, markdown_file: Path) -> Optional[str]:
+        """Extract Mermaid code from a Markdown file"""
+        try:
+            with open(markdown_file, 'r', encoding='utf-8') as f:
+                content = f.read()
+            # Find Mermaid code blocks
+            mermaid_pattern = r'```mermaid\s*\n(.*?)\n```'
+            match = re.search(mermaid_pattern, content, re.DOTALL)
+            if match:
+                return match.group(1).strip()
+            else:
+                print(f"No Mermaid diagram found in {markdown_file}")
+                return None
+        except Exception as e:
+            print(f"Error reading {markdown_file}: {e}")
+            return None
+    def convert_to_svg(self, mermaid_code: str, output_path: Path) -> bool:
+        """Convert Mermaid code to SVG using mermaid.ink service"""
+        try:
+            # Encode the Mermaid code for the URL
+            import base64
+            import urllib.parse
+            # Create the data URL format expected by mermaid.ink
+            mermaid_data = f"%%{{init: {{'theme': 'base', 'themeVariables': {{'primaryColor': '#e3f2fd', 'primaryTextColor': '#1976d2', 'primaryBorderColor': '#01579b', 'lineColor': '#424242', 'secondaryColor': '#fff3e0', 'tertiaryColor': '#fce4ec'}}}}}}%%\n{mermaid_code}"
+            # Base64 encode the mermaid code
+            encoded = base64.b64encode(mermaid_data.encode('utf-8')).decode('utf-8')
+            url_encoded = urllib.parse.quote(encoded)
+            # Create the full URL
+            full_url = f"{self.mermaid_api_url}{url_encoded}"
+            # Make the request
+            response = requests.get(full_url, timeout=30)
+            if response.status_code == 200:
+                # Save the SVG
+                with open(output_path, 'wb') as f:
+                    f.write(response.content)
+                print(f"✅ Generated SVG: {output_path}")
+                return True
+            else:
+                print(f"❌ Failed to generate SVG for {output_path}: HTTP {response.status_code}")
+                return False
+        except Exception as e:
+            print(f"❌ Error generating SVG for {output_path}: {e}")
+            return False
+    def process_markdown_file(self, markdown_file: Path, output_dir: Path) -> bool:
+        """Process a single Markdown file and generate its SVG"""
+        # Extract Mermaid code
+        mermaid_code = self.extract_mermaid_code(markdown_file)
+        if not mermaid_code:
+            return False
+        # Create output filename
+        svg_filename = markdown_file.stem + ".svg"
+        output_path = output_dir / svg_filename
+        # Convert to SVG
+        return self.convert_to_svg(mermaid_code, output_path)
+def main():
+    """Main function to generate SVGs for all documentation files"""
+    print("🔄 Generating SVG versions of documentation diagrams...")
+    # Setup paths
+    docs_dir = Path(__file__).parent.parent / "docs"
+    svgs_dir = docs_dir / "svgs"
+    # Create SVGs directory
+    svgs_dir.mkdir(exist_ok=True)
+    # Initialize converter
+    converter = MermaidToSVGConverter()
+    # Process all Markdown files in docs directory
+    markdown_files = [
+        "README.md",
+        "architecture.md",
+        "interface-workflow.md",
+        "training-pipeline.md",
+        "deployment-pipeline.md",
+        "data-flow.md"
+    ]
+    success_count = 0
+    total_count = len(markdown_files)
+    for filename in markdown_files:
+        markdown_path = docs_dir / filename
+        if markdown_path.exists():
+            print(f"\n📄 Processing {filename}...")
+            if converter.process_markdown_file(markdown_path, svgs_dir):
+                success_count += 1
+        else:
+            print(f"⚠️  File not found: {markdown_path}")
+    print(f"\n🎉 SVG generation complete!")
+    print(f"✅ Successfully generated: {success_count}/{total_count} SVGs")
+    print(f"📁 SVGs saved to: {svgs_dir}")
+    if success_count < total_count:
+        print(f"❌ Failed to generate: {total_count - success_count} SVGs")
+        return 1
+    return 0
+if __name__ == "__main__":
+    exit(main())

scripts/validate_mermaid.py ADDED Viewed

	@@ -0,0 +1,73 @@

+#!/usr/bin/env python3
+"""
+Validate Mermaid syntax in HTML documentation
+"""
+import re
+def validate_mermaid_html(html_file):
+    """Validate Mermaid diagrams in HTML file"""
+    print(f"🔍 Validating Mermaid syntax in {html_file}")
+    with open(html_file, 'r', encoding='utf-8') as f:
+        content = f.read()
+    # Find all Mermaid blocks
+    mermaid_pattern = r'<div class="mermaid">(.*?)</div>'
+    mermaid_blocks = re.findall(mermaid_pattern, content, re.DOTALL)
+    print(f"📊 Found {len(mermaid_blocks)} Mermaid blocks")
+    issues = []
+    # Check each Mermaid block
+    for i, block in enumerate(mermaid_blocks):
+        lines = block.strip().split('\n')
+        if not lines or not lines[0].strip():
+            issues.append(f"Block {i+1}: Empty Mermaid block")
+            continue
+        first_line = lines[0].strip()
+        # Check if it starts with a valid diagram type
+        valid_starts = [
+            'graph', 'flowchart', 'stateDiagram', 'sequenceDiagram',
+            'classDiagram', 'erDiagram', 'journey', 'gantt', 'pie',
+            'gitgraph', 'mindmap', 'timeline', 'sankey'
+        ]
+        if not any(first_line.startswith(start) for start in valid_starts):
+            issues.append(f"Block {i+1}: Invalid diagram type start - '{first_line}'")
+        # Check for classDef/class consistency
+        if 'classDef' in block:
+            class_statements = len(re.findall(r'^\s*class\s+', block, re.MULTILINE))
+            if class_statements == 0:
+                issues.append(f"Block {i+1}: classDef defined but no class statements found")
+        # Check for basic syntax issues
+        if block.count('[') != block.count(']'):
+            issues.append(f"Block {i+1}: Unmatched square brackets")
+        if block.count('(') != block.count(')'):
+            issues.append(f"Block {i+1}: Unmatched parentheses")
+        if 'subgraph' in block:
+            subgraph_count = block.count('subgraph')
+            end_count = block.count('end')
+            if subgraph_count != end_count:
+                issues.append(f"Block {i+1}: Unmatched subgraph/end blocks ({subgraph_count} vs {end_count})")
+    # Report results
+    print("\n🔍 Validation Results:")
+    if issues:
+        print("❌ Issues found:")
+        for issue in issues:
+            print(f"  - {issue}")
+        return False
+    else:
+        print("✅ No syntax issues found!")
+        return True
+if __name__ == "__main__":
+    validate_mermaid_html("docs/diagrams.html")