Spaces:

Tonic
/

VoxFactory

Running

App Files Files Community

Joseph Pollack commited on Sep 3

Commit

a595d5a

unverified ·

1 Parent(s): a3a3978

improves demo for automatic deployment and interface linking to deployment scripts

Browse files

Files changed (7) hide show

docs/blog-accessibility.md +497 -0
interface.py +12 -12
scripts/deploy_demo_space.py +74 -81
scripts/push_to_huggingface.py +93 -111
templates/model_card.md +1 -3
templates/spaces/demo_voxtral/README.md +15 -3
templates/spaces/demo_voxtral/app.py +82 -15

docs/blog-accessibility.md ADDED Viewed

	@@ -0,0 +1,497 @@

+# Accessible Speech Recognition: Fine‑tune Voxtral on Your Own Voice
+Building speech technology that understands everyone is an accessibility imperative. If you have a speech impediment (e.g., stutter, dysarthria, apraxia) or a heavy accent, mainstream ASR systems can struggle. This app lets you fine‑tune the Voxtral ASR model on your own voice so it adapts to your unique speaking style — improving recognition accuracy and unlocking more inclusive voice experiences.
+## Who this helps
+- **People with speech differences**: Personalized models that reduce error rates on your voice
+- **Accented speakers**: Adapt Voxtral to your accent and vocabulary
+- **Educators/clinicians**: Create tailored recognition models for communication support
+- **Product teams**: Prototype inclusive voice features with real users quickly
+## What you get
+- **Record or upload audio** and create a JSONL dataset in a few clicks
+- **One‑click training** with full fine‑tuning or LoRA for efficiency
+- **Automatic publishing** to Hugging Face Hub with a generated model card
+- **Instant demo deployment** to HF Spaces for shareable, live ASR
+## How it works (at a glance)
+```mermaid
+graph TD
+    %% Main Entry Point
+    START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
+    %% Documentation Categories
+    OVERVIEW --> ARCH[🏗️ Architecture Overview]
+    OVERVIEW --> WORKFLOW[🔄 Interface Workflow]
+    OVERVIEW --> TRAINING[🚀 Training Pipeline]
+    OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
+    OVERVIEW --> DATAFLOW[📊 Data Flow]
+    %% Architecture Section
+    ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
+    ARCH --> ARCH_LINK[📄 View Details →](architecture.md)
+    %% Interface Section
+    WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording → Training → Demo]
+    WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md)
+    %% Training Section
+    TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data → Model → Results]
+    TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md)
+    %% Deployment Section
+    DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model → Hub → Space]
+    DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md)
+    %% Data Flow Section
+    DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input → Processing → Output]
+    DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md)
+    %% Key Components Highlight
+    subgraph "🎛️ Core Components"
+        INTERFACE[interface.py<br/>Gradio Web UI]
+        TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
+        DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
+        PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
+    end
+    %% Data Flow Highlight
+    subgraph "📁 Key Data Formats"
+        JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
+        HFDATA[HF Hub Models<br/>username/model-name]
+        SPACES[HF Spaces<br/>Interactive Demos]
+    end
+    %% Connect components to their respective docs
+    INTERFACE --> WORKFLOW
+    TRAIN_SCRIPTS --> TRAINING
+    DEPLOY_SCRIPT --> DEPLOYMENT
+    PUSH_SCRIPT --> DEPLOYMENT
+    JSONL --> DATAFLOW
+    HFDATA --> DEPLOYMENT
+    SPACES --> DEPLOYMENT
+    %% Styling
+    classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
+    classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    class START entry
+    class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
+    class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
+    class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
+    class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
+    class JSONL,HFDATA,SPACES data
+```
+See the interactive diagram page for printing and quick navigation: [Interactive diagrams](diagrams.html).
+## Quick start
+### 1) Install
+```bash
+git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
+cd Finetune-Voxtral-ASR
+```
+Use UV (recommended) or pip.
+```bash
+# UV
+uv venv .venv --python 3.10 && source .venv/bin/activate
+uv pip install -r requirements.txt
+# or pip
+python -m venv .venv --python 3.10 && source .venv/bin/activate
+pip install --upgrade pip
+pip install -r requirements.txt
+```
+### 2) Launch the interface
+```bash
+python interface.py
+```
+The Gradio app guides you through language selection, recording or uploading audio, dataset creation, and training.
+## Create your voice dataset (UI)
+```mermaid
+stateDiagram-v2
+    [*] --> LanguageSelection: User opens interface
+    state "Language & Dataset Setup" as LangSetup {
+        [*] --> LanguageSelection
+        LanguageSelection --> LoadPhrases: Select language
+        LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
+        DisplayPhrases --> RecordingInterface: Show phrases & recording UI
+        state RecordingInterface {
+            [*] --> ShowInitialRows: Display first 10 phrases
+            ShowInitialRows --> RecordAudio: User can record audio
+            RecordAudio --> AddMoreRows: Optional - add 10 more rows
+            AddMoreRows --> RecordAudio
+        }
+    }
+    RecordingInterface --> DatasetCreation: User finishes recording
+    state "Dataset Creation Options" as DatasetCreation {
+        [*] --> FromRecordings: Create from recorded audio
+        [*] --> FromUploads: Upload existing files
+        FromRecordings --> ProcessRecordings: Save WAV files + transcripts
+        FromUploads --> ProcessUploads: Process uploaded files + transcripts
+        ProcessRecordings --> CreateJSONL: Generate JSONL dataset
+        ProcessUploads --> CreateJSONL
+        CreateJSONL --> DatasetReady: Dataset saved locally
+    }
+    DatasetCreation --> TrainingConfiguration: Dataset ready
+    state "Training Setup" as TrainingConfiguration {
+        [*] --> BasicSettings: Model, LoRA/full, batch size
+        [*] --> AdvancedSettings: Learning rate, epochs, LoRA params
+        BasicSettings --> ConfigureDeployment: Repo name, push options
+        AdvancedSettings --> ConfigureDeployment
+        ConfigureDeployment --> StartTraining: All settings configured
+    }
+    TrainingConfiguration --> TrainingProcess: Start training
+    state "Training Process" as TrainingProcess {
+        [*] --> InitializeTrackio: Setup experiment tracking
+        InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
+        RunTrainingScript --> StreamLogs: Show real-time training logs
+        StreamLogs --> MonitorProgress: Track metrics & checkpoints
+        MonitorProgress --> TrainingComplete: Training finished
+        MonitorProgress --> HandleErrors: Training failed
+        HandleErrors --> RetryOrExit: User can retry or exit
+    }
+    TrainingProcess --> PostTraining: Training complete
+    state "Post-Training Actions" as PostTraining {
+        [*] --> PushToHub: Push model to HF Hub
+        [*] --> GenerateModelCard: Create model card
+        [*] --> DeployDemoSpace: Deploy interactive demo
+        PushToHub --> ModelPublished: Model available on HF Hub
+        GenerateModelCard --> ModelDocumented: Model card created
+        DeployDemoSpace --> DemoReady: Demo space deployed
+    }
+    PostTraining --> [*]: Process complete
+    %% Alternative paths
+    DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
+    PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
+    %% Error handling
+    TrainingProcess --> ErrorRecovery: Handle training errors
+    ErrorRecovery --> RetryTraining: Retry with different settings
+    RetryTraining --> TrainingConfiguration
+    %% Styling and notes
+    note right of LanguageSelection : User selects language for\n        authentic phrases from\n        NVIDIA Granary dataset
+    note right of RecordingInterface : Users record themselves\n        reading displayed phrases
+    note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
+    note right of TrainingConfiguration : Configure LoRA parameters,\n        learning rate, epochs, etc.
+    note right of TrainingProcess : Real-time log streaming\n        with Trackio integration
+    note right of PostTraining : Automated deployment\n        pipeline
+```
+Steps you’ll follow in the UI:
+- **Choose language**: Select a language for authentic phrases (from NVIDIA Granary)
+- **Record or upload**: Capture your voice or provide existing audio + transcripts
+- **Create dataset**: The app writes a JSONL file with entries like `{ "audio_path": ..., "text": ... }`
+- **Configure training**: Pick base model, LoRA vs full, batch size and learning rate
+- **Run training**: Watch live logs and metrics; resume on error if needed
+- **Publish & deploy**: Push to HF Hub and one‑click deploy an interactive Space
+## Train your personalized Voxtral model
+Under the hood, training uses Hugging Face Trainer and a custom `VoxtralDataCollator` that builds Voxtral/LLaMA‑style prompts and masks the prompt tokens so loss is computed only on the transcription.
+```mermaid
+graph TB
+    %% Input Data Sources
+    subgraph "Data Sources"
+        JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
+        GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
+        HFDATA[HF Hub Datasets<br/>Community Datasets]
+    end
+    %% Data Processing
+    subgraph "Data Processing"
+        LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
+        CASTER[Audio Casting<br/>16kHz resampling]
+        COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
+    end
+    %% Training Scripts
+    subgraph "Training Scripts"
+        TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
+        TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]
+        subgraph "Training Components"
+            MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
+            LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
+            PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
+        end
+    end
+    %% Training Infrastructure
+    subgraph "Training Infrastructure"
+        TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
+        HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
+        TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
+    end
+    %% Training Process
+    subgraph "Training Process"
+        FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
+        LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
+        BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
+        OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
+        LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
+    end
+    %% Model Management
+    subgraph "Model Management"
+        CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
+        MODEL_SAVING[Final Model Saving<br/>Processor + Model]
+        LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
+    end
+    %% Flow Connections
+    JSONL --> LOADER
+    GRANARY --> LOADER
+    HFDATA --> LOADER
+    LOADER --> CASTER
+    CASTER --> COLLATOR
+    COLLATOR --> TRAIN_FULL
+    COLLATOR --> TRAIN_LORA
+    TRAIN_FULL --> MODEL_INIT
+    TRAIN_LORA --> MODEL_INIT
+    TRAIN_LORA --> LORA_CONFIG
+    MODEL_INIT --> PROCESSOR_INIT
+    LORA_CONFIG --> PROCESSOR_INIT
+    PROCESSOR_INIT --> TRACKIO_INIT
+    PROCESSOR_INIT --> HF_TRAINER
+    PROCESSOR_INIT --> TORCH_DEVICE
+    TRACKIO_INIT --> HF_TRAINER
+    TORCH_DEVICE --> HF_TRAINER
+    HF_TRAINER --> FORWARD_PASS
+    FORWARD_PASS --> LOSS_CALC
+    LOSS_CALC --> BACKWARD_PASS
+    BACKWARD_PASS --> OPTIMIZER_STEP
+    OPTIMIZER_STEP --> LOGGING
+    LOGGING --> CHECKPOINT_SAVING
+    LOGGING --> TRACKIO_INIT
+    HF_TRAINER --> MODEL_SAVING
+    MODEL_SAVING --> LOCAL_STORAGE
+    %% Styling
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class JSONL,GRANARY,HFDATA input
+    class LOADER,CASTER,COLLATOR processing
+    class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
+    class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
+    class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
+    class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
+```
+CLI alternatives (if you prefer the terminal):
+```bash
+# Full fine-tuning
+uv run train.py
+# Parameter‑efficient LoRA fine‑tuning (recommended for most users)
+uv run train_lora.py
+```
+## Publish and deploy a live demo
+After training, the app can push your model and metrics to the Hugging Face Hub and create an interactive Space demo automatically.
+```mermaid
+graph TB
+    %% Input Sources
+    subgraph "Inputs"
+        TRAINED_MODEL[Trained Model<br/>Local directory]
+        TRAINING_CONFIG[Training Config<br/>JSON/YAML]
+        TRAINING_RESULTS[Training Results<br/>Metrics & logs]
+        MODEL_METADATA[Model Metadata<br/>Name, description, etc.]
+    end
+    %% Model Publishing
+    subgraph "Model Publishing"
+        PUSH_SCRIPT[push_to_huggingface.py<br/>Model Publisher]
+        subgraph "Publishing Steps"
+            REPO_CREATION[Repository Creation<br/>HF Hub API]
+            FILE_UPLOAD[File Upload<br/>Model files to HF]
+            METADATA_UPLOAD[Metadata Upload<br/>Config & results]
+        end
+    end
+    %% Model Card Generation
+    subgraph "Model Card Generation"
+        CARD_SCRIPT[generate_model_card.py<br/>Card Generator]
+        subgraph "Card Components"
+            TEMPLATE_LOAD[Template Loading<br/>model_card.md]
+            VARIABLE_REPLACEMENT[Variable Replacement<br/>Config injection]
+            CONDITIONAL_PROCESSING[Conditional Sections<br/>Quantized models, etc.]
+        end
+    end
+    %% Demo Space Deployment
+    subgraph "Demo Space Deployment"
+        DEPLOY_SCRIPT[deploy_demo_space.py<br/>Space Deployer]
+        subgraph "Space Setup"
+            SPACE_CREATION[Space Repository<br/>Create HF Space]
+            TEMPLATE_COPY[Template Copying<br/>demo_voxtral/ files]
+            ENV_INJECTION[Environment Setup<br/>Model config injection]
+            SECRET_SETUP[Secret Configuration<br/>HF_TOKEN, model vars]
+        end
+    end
+    %% Space Building & Testing
+    subgraph "Space Building"
+        BUILD_TRIGGER[Build Trigger<br/>Automatic build start]
+        DEPENDENCY_INSTALL[Dependency Installation<br/>requirements.txt]
+        MODEL_DOWNLOAD[Model Download<br/>From HF Hub]
+        APP_INITIALIZATION[App Initialization<br/>Gradio app setup]
+    end
+    %% Live Demo
+    subgraph "Live Demo Space"
+        GRADIO_INTERFACE[Gradio Interface<br/>Interactive demo]
+        MODEL_INFERENCE[Model Inference<br/>Real-time ASR]
+        USER_INTERACTION[User Interaction<br/>Audio upload/playback]
+    end
+    %% External Services
+    subgraph "External Services"
+        HF_HUB[Hugging Face Hub<br/>Model & Space hosting]
+        HF_SPACES[HF Spaces Platform<br/>Demo hosting]
+    end
+    %% Flow Connections
+    TRAINED_MODEL --> PUSH_SCRIPT
+    TRAINING_CONFIG --> PUSH_SCRIPT
+    TRAINING_RESULTS --> PUSH_SCRIPT
+    MODEL_METADATA --> PUSH_SCRIPT
+    PUSH_SCRIPT --> REPO_CREATION
+    REPO_CREATION --> FILE_UPLOAD
+    FILE_UPLOAD --> METADATA_UPLOAD
+    METADATA_UPLOAD --> CARD_SCRIPT
+    TRAINING_CONFIG --> CARD_SCRIPT
+    TRAINING_RESULTS --> CARD_SCRIPT
+    CARD_SCRIPT --> TEMPLATE_LOAD
+    TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
+    VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
+    CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
+    METADATA_UPLOAD --> DEPLOY_SCRIPT
+    DEPLOY_SCRIPT --> SPACE_CREATION
+    SPACE_CREATION --> TEMPLATE_COPY
+    TEMPLATE_COPY --> ENV_INJECTION
+    ENV_INJECTION --> SECRET_SETUP
+    SECRET_SETUP --> BUILD_TRIGGER
+    BUILD_TRIGGER --> DEPENDENCY_INSTALL
+    DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
+    MODEL_DOWNLOAD --> APP_INITIALIZATION
+    APP_INITIALIZATION --> GRADIO_INTERFACE
+    GRADIO_INTERFACE --> MODEL_INFERENCE
+    MODEL_INFERENCE --> USER_INTERACTION
+    HF_HUB --> MODEL_DOWNLOAD
+    HF_SPACES --> GRADIO_INTERFACE
+    %% Styling
+    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+    classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
+    classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+    classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+    classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+    classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
+    classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
+    class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
+    class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
+    class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
+    class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
+    class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
+    class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
+    class HF_HUB,HF_SPACES external
+```
+## Why personalization improves accessibility
+- **Your model learns your patterns**: tempo, prosody, phoneme realizations, disfluencies
+- **Vocabulary and names**: teach domain terms and proper nouns you use often
+- **Bias correction**: reduce systematic errors common to off‑the‑shelf ASR for your voice
+- **Agency and privacy**: keep data local and only publish when you choose
+## Practical tips
+- **Start with LoRA**: Parameter‑efficient fine‑tuning is faster and uses less memory
+- **Record diverse samples**: Different tempos, environments, and phrase lengths
+- **Short sessions**: Many shorter clips beat a few long ones for learning
+- **Check transcripts**: Clean, accurate transcripts improve outcomes
+## Learn more
+- [Repository README](../README.md)
+- [Documentation Overview](README.md)
+- [Architecture Overview](architecture.md)
+- [Interface Workflow](interface-workflow.md)
+- [Training Pipeline](training-pipeline.md)
+- [Deployment Pipeline](deployment-pipeline.md)
+- [Data Flow](data-flow.md)
+- [Interactive Diagrams](diagrams.html)
+---
+This project exists to make voice technology work better for everyone. If you build a model that helps you — or your community — consider sharing a demo so others can learn from it.

interface.py CHANGED Viewed

@@ -745,15 +745,15 @@ with gr.Blocks(title="Voxtral ASR Fine-tuning") as demo:
         def _collect_upload(files, txt):
             lines = [s.strip() for s in (txt or "").splitlines() if s.strip()]
             jsonl_path = _save_uploaded_dataset(files or [], lines)
-            return f"✅ Dataset saved locally: {jsonl_path}"
-        def _push_dataset_handler(repo_name):
-            if not jsonl_path_state.value:
                 return "❌ No dataset saved yet. Please save dataset first."
-            return _push_dataset_to_hub(jsonl_path_state.value, repo_name)
-        save_upload_btn.click(_collect_upload, [upload_audio, transcripts_box], [jsonl_path_state])
-        push_dataset_btn.click(_push_dataset_handler, [dataset_repo_name], [jsonl_path_state])
     # Save recordings button
     save_rec_btn = gr.Button("Save recordings as dataset", visible=False)
@@ -782,16 +782,16 @@ with gr.Blocks(title="Voxtral ASR Fine-tuning") as demo:
             rows.append({"audio_path": str(out_path), "text": label_text})
         jsonl_path = dataset_dir / "data.jsonl"
         _write_jsonl(rows, jsonl_path)
-        return str(jsonl_path)
-    save_rec_btn.click(_collect_preloaded_recs, rec_components + [phrase_texts_state], [jsonl_path_state])
-    def _push_recordings_handler(repo_name):
-        if not jsonl_path_state.value:
             return "❌ No recordings dataset saved yet. Please save recordings first."
-        return _push_dataset_to_hub(jsonl_path_state.value, repo_name)
-    push_recordings_btn.click(_push_recordings_handler, [dataset_repo_name], [jsonl_path_state])
     # Removed multilingual dataset sample section - phrases are now loaded automatically when language is selected

         def _collect_upload(files, txt):
             lines = [s.strip() for s in (txt or "").splitlines() if s.strip()]
             jsonl_path = _save_uploaded_dataset(files or [], lines)
+            return str(jsonl_path), f"✅ Dataset saved locally: {jsonl_path}"
+        def _push_dataset_handler(repo_name, current_jsonl_path):
+            if not current_jsonl_path:
                 return "❌ No dataset saved yet. Please save dataset first."
+            return _push_dataset_to_hub(current_jsonl_path, repo_name)
+        save_upload_btn.click(_collect_upload, [upload_audio, transcripts_box], [jsonl_path_state, dataset_status])
+        push_dataset_btn.click(_push_dataset_handler, [dataset_repo_name, jsonl_path_state], [dataset_status])
     # Save recordings button
     save_rec_btn = gr.Button("Save recordings as dataset", visible=False)
             rows.append({"audio_path": str(out_path), "text": label_text})
         jsonl_path = dataset_dir / "data.jsonl"
         _write_jsonl(rows, jsonl_path)
+        return str(jsonl_path), f"✅ Dataset saved locally: {jsonl_path}"
+    save_rec_btn.click(_collect_preloaded_recs, rec_components + [phrase_texts_state], [jsonl_path_state, dataset_status])
+    def _push_recordings_handler(repo_name, current_jsonl_path):
+        if not current_jsonl_path:
             return "❌ No recordings dataset saved yet. Please save recordings first."
+        return _push_dataset_to_hub(current_jsonl_path, repo_name)
+    push_recordings_btn.click(_push_recordings_handler, [dataset_repo_name, jsonl_path_state], [dataset_status])
     # Removed multilingual dataset sample section - phrases are now loaded automatically when language is selected

scripts/deploy_demo_space.py CHANGED Viewed

@@ -25,11 +25,9 @@ except ImportError:
     HF_HUB_AVAILABLE = False
     print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
-# Add src to path for imports
 sys.path.append(str(Path(__file__).parent.parent / "src"))
-from config import SmolLM3Config
 # Setup logging
 logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
@@ -223,14 +221,9 @@ os.environ['BRAND_PROJECT_URL'] = {_json.dumps(self.brand_project_url)}
 """
         elif self.demo_type == "voxtral":
-            import json as _json
-            env_setup = f"""
-# Environment variables for Voxtral ASR demo
-import os
-os.environ['HF_MODEL_ID'] = {_json.dumps(self.model_id)}
-os.environ['MODEL_NAME'] = {_json.dumps(self.model_id.split('/')[-1])}
-os.environ['HF_USERNAME'] = {_json.dumps(self.hf_username)}
-"""
         else:
             # For SmolLM models, use simpler setup
             import json as _json
@@ -534,80 +527,80 @@ os.environ['BRAND_PROJECT_URL'] = {_json.dumps(self.brand_project_url)}
                     copied_files.append(file_path.name)
                     logger.info(f"✅ Copied {file_path.name} to temp directory")
-            # Update app.py with environment variables
             app_file = Path(temp_dir) / "app.py"
-            if app_file.exists():
                 with open(app_file, 'r', encoding='utf-8') as f:
                     content = f.read()
-                # Add environment variable setup at the top
                 env_setup = self._generate_env_setup()
-                # Insert after imports
-                lines = content.split('\n')
-                import_end = 0
-                for i, line in enumerate(lines):
-                    if line.startswith('import ') or line.startswith('from '):
-                        import_end = i + 1
-                    elif line.strip() == '' and import_end > 0:
-                        break
-                lines.insert(import_end, env_setup)
-                content = '\n'.join(lines)
-                with open(app_file, 'w', encoding='utf-8') as f:
-                    f.write(content)
-                logger.info("✅ Updated app.py with model configuration")
-            # YAML front matter required by Hugging Face Spaces
-            yaml_front_matter = (
-                f"---\n"
-                f"title: {'GPT-OSS Demo' if self.demo_type == 'gpt' else 'SmolLM3 Demo'}\n"
-                f"emoji: {'🌟' if self.demo_type == 'gpt' else '💃🏻'}\n"
-                f"colorFrom: {'blue' if self.demo_type == 'gpt' else 'green'}\n"
-                f"colorTo: {'pink' if self.demo_type == 'gpt' else 'purple'}\n"
-                f"sdk: gradio\n"
-                f"sdk_version: 5.40.0\n"
-                f"app_file: app.py\n"
-                f"pinned: false\n"
-                f"short_description: Interactive demo for {self.model_id}\n"
-                + ("license: mit\n" if self.demo_type != 'gpt' else "") +
-                f"---\n\n"
-            )
-            # Create README.md for the space (include configuration details)
-            readme_content = (
-                yaml_front_matter
-                + f"# Demo: {self.model_id}\n\n"
-                + f"This is an interactive demo for the fine-tuned model {self.model_id}.\n\n"
-                + "## Features\n"
-                  "- Interactive chat interface\n"
-                  "- Customizable system & developer prompts\n"
-                  "- Advanced generation parameters\n"
-                  "- Thinking mode support\n\n"
-                + "## Model Information\n"
-                  f"- **Model ID**: {self.model_id}\n"
-                  f"- **Subfolder**: {self.subfolder if self.subfolder and self.subfolder.strip() else 'main'}\n"
-                  f"- **Deployed by**: {self.hf_username}\n"
-                  + ("- **Base Model**: openai/gpt-oss-20b\n" if self.demo_type == 'gpt' else "")
-                  + "\n"
-                + "## Configuration\n"
-                  "- **Model Identity**:\n\n"
-                  f"```\n{self.model_identity or 'Not set'}\n```\n\n"
-                  "- **System Message** (default):\n\n"
-                  f"```\n{(self.system_message or self.model_identity) or 'Not set'}\n```\n\n"
-                  "- **Developer Message** (default):\n\n"
-                  f"```\n{self.developer_message or 'Not set'}\n```\n\n"
-                  "These defaults come from the selected training configuration and can be adjusted in the UI when you run the demo.\n\n"
-                + "## Usage\n"
-                  "Simply start chatting with the model using the interface below!\n\n"
-                + "---\n"
-                  "*This demo was automatically deployed by the SmolFactory Fine-tuning Pipeline*\n"
-            )
-            with open(Path(temp_dir) / "README.md", 'w', encoding='utf-8') as f:
-                f.write(readme_content)
             logger.info(f"✅ Prepared {len(copied_files)} files in temporary directory")
             return temp_dir
@@ -874,7 +867,7 @@ def main():
     parser.add_argument("--model-id", required=True, help="Model ID to deploy demo for")
     parser.add_argument("--subfolder", default="int4", help="Model subfolder (default: int4)")
     parser.add_argument("--space-name", help="Custom space name (optional)")
-    parser.add_argument("--demo-type", choices=["smol", "gpt"], help="Demo type: 'smol' for SmolLM, 'gpt' for GPT-OSS (auto-detected if not specified)")
     parser.add_argument("--config-file", help="Path to the training config file to import context (system/developer/model_identity)")
     # Examples configuration
     parser.add_argument("--examples-type", choices=["general", "medical"], help="Examples pack to enable in the demo UI")

     HF_HUB_AVAILABLE = False
     print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
+# Add src to path for imports (kept for potential future imports)
 sys.path.append(str(Path(__file__).parent.parent / "src"))
 # Setup logging
 logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
 """
         elif self.demo_type == "voxtral":
+            # For Voxtral, we do not inject env setup into app.py.
+            # Space variables are set via the API in set_space_secrets().
+            env_setup = ""
         else:
             # For SmolLM models, use simpler setup
             import json as _json
                     copied_files.append(file_path.name)
                     logger.info(f"✅ Copied {file_path.name} to temp directory")
+            # Update app.py with environment variables (skip for Voxtral)
             app_file = Path(temp_dir) / "app.py"
+            if app_file.exists() and self.demo_type != "voxtral":
                 with open(app_file, 'r', encoding='utf-8') as f:
                     content = f.read()
                 env_setup = self._generate_env_setup()
+                if env_setup:
+                    # Insert after imports
+                    lines = content.split('\n')
+                    import_end = 0
+                    for i, line in enumerate(lines):
+                        if line.startswith('import ') or line.startswith('from '):
+                            import_end = i + 1
+                        elif line.strip() == '' and import_end > 0:
+                            break
+                    lines.insert(import_end, env_setup)
+                    content = '\n'.join(lines)
+                    with open(app_file, 'w', encoding='utf-8') as f:
+                        f.write(content)
+                    logger.info("✅ Updated app.py with model configuration")
+            # For Voxtral keep the template README. For others, create a README with YAML front matter.
+            if self.demo_type != "voxtral":
+                yaml_front_matter = (
+                    f"---\n"
+                    f"title: {'GPT-OSS Demo' if self.demo_type == 'gpt' else 'SmolLM3 Demo'}\n"
+                    f"emoji: {'🌟' if self.demo_type == 'gpt' else '💃🏻'}\n"
+                    f"colorFrom: {'blue' if self.demo_type == 'gpt' else 'green'}\n"
+                    f"colorTo: {'pink' if self.demo_type == 'gpt' else 'purple'}\n"
+                    f"sdk: gradio\n"
+                    f"sdk_version: 5.40.0\n"
+                    f"app_file: app.py\n"
+                    f"pinned: false\n"
+                    f"short_description: Interactive demo for {self.model_id}\n"
+                    + ("license: mit\n" if self.demo_type != 'gpt' else "") +
+                    f"---\n\n"
+                )
+                readme_content = (
+                    yaml_front_matter
+                    + f"# Demo: {self.model_id}\n\n"
+                    + f"This is an interactive demo for the fine-tuned model {self.model_id}.\n\n"
+                    + "## Features\n"
+                      "- Interactive chat interface\n"
+                      "- Customizable system & developer prompts\n"
+                      "- Advanced generation parameters\n"
+                      "- Thinking mode support\n\n"
+                    + "## Model Information\n"
+                      f"- **Model ID**: {self.model_id}\n"
+                      f"- **Subfolder**: {self.subfolder if self.subfolder and self.subfolder.strip() else 'main'}\n"
+                      f"- **Deployed by**: {self.hf_username}\n"
+                      + ("- **Base Model**: openai/gpt-oss-20b\n" if self.demo_type == 'gpt' else "")
+                      + "\n"
+                    + "## Configuration\n"
+                      "- **Model Identity**:\n\n"
+                      f"```\n{self.model_identity or 'Not set'}\n```\n\n"
+                      "- **System Message** (default):\n\n"
+                      f"```\n{(self.system_message or self.model_identity) or 'Not set'}\n```\n\n"
+                      "- **Developer Message** (default):\n\n"
+                      f"```\n{self.developer_message or 'Not set'}\n```\n\n"
+                      "These defaults come from the selected training configuration and can be adjusted in the UI when you run the demo.\n\n"
+                    + "## Usage\n"
+                      "Simply start chatting with the model using the interface below!\n\n"
+                    + "---\n"
+                      "*This demo was automatically deployed by the SmolFactory Fine-tuning Pipeline*\n"
+                )
+                with open(Path(temp_dir) / "README.md", 'w', encoding='utf-8') as f:
+                    f.write(readme_content)
             logger.info(f"✅ Prepared {len(copied_files)} files in temporary directory")
             return temp_dir
     parser.add_argument("--model-id", required=True, help="Model ID to deploy demo for")
     parser.add_argument("--subfolder", default="int4", help="Model subfolder (default: int4)")
     parser.add_argument("--space-name", help="Custom space name (optional)")
+    parser.add_argument("--demo-type", choices=["smol", "gpt", "voxtral"], help="Demo type: 'smol' for SmolLM, 'gpt' for GPT-OSS, 'voxtral' for Voxtral ASR (auto-detected if not specified)")
     parser.add_argument("--config-file", help="Path to the training config file to import context (system/developer/model_identity)")
     # Examples configuration
     parser.add_argument("--examples-type", choices=["general", "medical"], help="Examples pack to enable in the demo UI")

scripts/push_to_huggingface.py CHANGED Viewed

@@ -69,6 +69,8 @@ class HuggingFacePusher:
         # Resolve the full repo id (username/repo) if user only provided repo name
         self.repo_id = self._resolve_repo_id(self.repo_name)
         logger.info(f"Initialized HuggingFacePusher for {self.repo_id}")
@@ -133,37 +135,57 @@ class HuggingFacePusher:
             logger.error(f"❌ Failed to create repository: {e}")
             return False
-    def validate_model_path(self) -> bool:
-        """Validate that the model path contains required files"""
-        # Support both safetensors and pytorch formats
-        required_files = [
-            "config.json",
-            "tokenizer.json",
-            "tokenizer_config.json"
         ]
-        # Check for model files (either safetensors or pytorch)
-        model_files = [
-            "model.safetensors.index.json",  # Safetensors format
-            "pytorch_model.bin"  # PyTorch format
         ]
-        missing_files = []
-        for file in required_files:
-            if not (self.model_path / file).exists():
-                missing_files.append(file)
-        # Check if at least one model file exists
-        model_file_exists = any((self.model_path / file).exists() for file in model_files)
-        if not model_file_exists:
-            missing_files.extend(model_files)
-        if missing_files:
-            logger.error(f"❌ Missing required files: {missing_files}")
-            return False
-        logger.info("✅ Model files validated")
-        return True
     def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
         """Create a comprehensive model card using the generate_model_card.py script"""
@@ -215,88 +237,48 @@ class HuggingFacePusher:
             return self._create_simple_model_card(training_config, results)
     def _create_simple_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
-        """Create a simple model card without complex YAML to avoid formatting issues"""
-        return f"""---
-language:
-- en
-- fr
-license: apache-2.0
-tags:
-- smollm3
-- fine-tuned
-- causal-lm
-- text-generation
-pipeline_tag: text-generation
-base_model: HuggingFaceTB/SmolLM3-3B
----
-# {self.repo_id.split('/')[-1]}
-This is a fine-tuned SmolLM3 model based on the HuggingFaceTB/SmolLM3-3B architecture.
-## Model Details
-- **Base Model**: HuggingFaceTB/SmolLM3-3B
-- **Fine-tuning Method**: Supervised Fine-tuning
-- **Training Date**: {datetime.now().strftime('%Y-%m-%d')}
-- **Model Size**: {self._get_model_size():.1f} GB
-- **Dataset Repository**: {self.dataset_repo}
-- **Hardware**: {self._get_hardware_info()}
-## Training Configuration
-```json
-{json.dumps(training_config, indent=2)}
-```
-## Training Results
-```json
-{json.dumps(results, indent=2)}
-```
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-# Load model and tokenizer
-model = AutoModelForCausalLM.from_pretrained("{self.repo_id}")
-tokenizer = AutoTokenizer.from_pretrained("{self.repo_id}")
-# Generate text
-inputs = tokenizer("Hello, how are you?", return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-## Training Information
-- **Base Model**: HuggingFaceTB/SmolLM3-3B
-- **Hardware**: {self._get_hardware_info()}
-- **Training Time**: {results.get('training_time_hours', 'Unknown')} hours
-- **Final Loss**: {results.get('final_loss', 'Unknown')}
-- **Final Accuracy**: {results.get('final_accuracy', 'Unknown')}
-- **Dataset Repository**: {self.dataset_repo}
-## Model Performance
-- **Training Loss**: {results.get('train_loss', 'Unknown')}
-- **Validation Loss**: {results.get('eval_loss', 'Unknown')}
-- **Training Steps**: {results.get('total_steps', 'Unknown')}
-## Experiment Tracking
-This model was trained with experiment tracking enabled. Training metrics and configuration are stored in the HF Dataset repository: `{self.dataset_repo}`
-## Limitations and Biases
-This model is fine-tuned for specific tasks and may not generalize well to all use cases. Please evaluate the model's performance on your specific task before deployment.
-## License
-This model is licensed under the Apache 2.0 License.
-"""
     def _get_model_size(self) -> float:
         """Get model size in GB"""

         # Resolve the full repo id (username/repo) if user only provided repo name
         self.repo_id = self._resolve_repo_id(self.repo_name)
+        # Artifact type detection (full vs lora)
+        self.artifact_type: Optional[str] = None
         logger.info(f"Initialized HuggingFacePusher for {self.repo_id}")
             logger.error(f"❌ Failed to create repository: {e}")
             return False
+    def _detect_artifact_type(self) -> str:
+        """Detect whether output dir contains a full model or a LoRA adapter."""
+        # LoRA artifacts
+        lora_candidates = [
+            self.model_path / "adapter_config.json",
+            self.model_path / "adapter_model.safetensors",
+            self.model_path / "adapter_model.bin",
         ]
+        if any(p.exists() for p in lora_candidates) and (self.model_path / "adapter_config.json").exists():
+            return "lora"
+        # Full model artifacts
+        full_candidates = [
+            self.model_path / "config.json",
+            self.model_path / "model.safetensors",
+            self.model_path / "model.safetensors.index.json",
+            self.model_path / "pytorch_model.bin",
         ]
+        if any(p.exists() for p in full_candidates):
+            return "full"
+        return "unknown"
+    def validate_model_path(self) -> bool:
+        """Validate that the model path contains required files for Voxtral full or LoRA."""
+        self.artifact_type = self._detect_artifact_type()
+        if self.artifact_type == "lora":
+            required = [self.model_path / "adapter_config.json"]
+            if not all(p.exists() for p in required):
+                logger.error("❌ LoRA artifacts missing required files (adapter_config.json)")
+                return False
+            # At least one adapter weight
+            if not ((self.model_path / "adapter_model.safetensors").exists() or (self.model_path / "adapter_model.bin").exists()):
+                logger.error("❌ LoRA artifacts missing adapter weights (adapter_model.safetensors or adapter_model.bin)")
+                return False
+            logger.info("✅ Detected LoRA adapter artifacts")
+            return True
+        if self.artifact_type == "full":
+            # Relaxed set: require config.json and at least one model weights file
+            if not (self.model_path / "config.json").exists():
+                logger.error("❌ Missing config.json in model directory")
+                return False
+            if not ((self.model_path / "model.safetensors").exists() or (self.model_path / "model.safetensors.index.json").exists() or (self.model_path / "pytorch_model.bin").exists()):
+                logger.error("❌ Missing model weights file (model.safetensors or pytorch_model.bin)")
+                return False
+            logger.info("✅ Detected full model artifacts")
+            return True
+        logger.error("❌ Could not detect model artifacts (neither full model nor LoRA)")
+        return False
     def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
         """Create a comprehensive model card using the generate_model_card.py script"""
             return self._create_simple_model_card(training_config, results)
     def _create_simple_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
+        """Create a simple model card tailored for Voxtral ASR (supports full and LoRA)."""
+        tags = ["voxtral", "asr", "speech-to-text", "fine-tuning"]
+        if self.artifact_type == "lora":
+            tags.append("lora")
+        front_matter = {
+            "license": "apache-2.0",
+            "tags": tags,
+            "pipeline_tag": "automatic-speech-recognition",
+        }
+        fm_yaml = "---\n" + "\n".join([
+            "license: apache-2.0",
+            "tags:",
+        ]) + "\n" + "\n".join([f"- {t}" for t in tags]) + "\n" + "pipeline_tag: automatic-speech-recognition\n---\n\n"
+        model_title = self.repo_id.split('/')[-1]
+        body = [
+            f"# {model_title}",
+            "",
+            ("This repository contains a LoRA adapter for Voxtral ASR. "
+             "Merge the adapter with the base model or load via PEFT for inference." if self.artifact_type == "lora" else
+             "This repository contains a fine-tuned Voxtral ASR model."),
+            "",
+            "## Usage",
+            "",
+            ("```python\nfrom transformers import AutoProcessor\nfrom peft import PeftModel\nfrom transformers import AutoModelForSeq2SeqLM\n\nbase_model_id = 'mistralai/Voxtral-Mini-3B-2507'\nprocessor = AutoProcessor.from_pretrained(base_model_id)\nbase_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id)\nmodel = PeftModel.from_pretrained(base_model, '{self.repo_id}')\n```" if self.artifact_type == "lora" else
+             f"""```python
+from transformers import AutoProcessor, AutoModelForSeq2SeqLM
+processor = AutoProcessor.from_pretrained("{self.repo_id}")
+model = AutoModelForSeq2SeqLM.from_pretrained("{self.repo_id}")
+```"""),
+            "",
+            "## Training Configuration",
+            "",
+            f"```json\n{json.dumps(training_config or {}, indent=2)}\n```",
+            "",
+            "## Training Results",
+            "",
+            f"```json\n{json.dumps(results or {}, indent=2)}\n```",
+            "",
+            f"**Hardware**: {self._get_hardware_info()}",
+        ]
+        return fm_yaml + "\n".join(body)
     def _get_model_size(self) -> float:
         """Get model size in GB"""

templates/model_card.md CHANGED Viewed

@@ -5,12 +5,10 @@ language:
 license: apache-2.0
 library_name: transformers
 tags:
-- smollm3
 - fine-tuned
-- causal-lm
 - text-generation
 - tonic
-- legml
 {{#if quantized_models}}- quantized{{/if}}
 pipeline_tag: text-generation
 base_model: {{base_model}}

 license: apache-2.0
 library_name: transformers
 tags:
+- voxtral
 - fine-tuned
 - text-generation
 - tonic
 {{#if quantized_models}}- quantized{{/if}}
 pipeline_tag: text-generation
 base_model: {{base_model}}

templates/spaces/demo_voxtral/README.md CHANGED Viewed

@@ -12,12 +12,24 @@ short_description: Interactive ASR demo for a fine-tuned Voxtral model
 This Space serves a Voxtral ASR model for speech-to-text transcription.
 Usage:
-- Click Record and read the displayed phrase aloud.
-- Stop recording to see the transcription.
-- Works best with ~16 kHz audio; internal processing follows Voxtral's processor expectations.
 Environment variables expected:
 - `HF_MODEL_ID`: The model repo to load (e.g., `username/voxtral-finetune-YYYYMMDD_HHMMSS`)
 - `MODEL_NAME`: Display name
 - `HF_USERNAME`: For branding

 This Space serves a Voxtral ASR model for speech-to-text transcription.
 Usage:
+- Select a language (or leave on Auto for detection).
+- Upload an audio file or record via microphone.
+- Click Transcribe to see the transcription.
+- Works best with standard speech audio; Voxtral handles language detection by default.
 Environment variables expected:
 - `HF_MODEL_ID`: The model repo to load (e.g., `username/voxtral-finetune-YYYYMMDD_HHMMSS`)
 - `MODEL_NAME`: Display name
 - `HF_USERNAME`: For branding
+- `MODEL_SUBFOLDER`: Optional subfolder in the repo (e.g., `int4`) for quantized/packed weights
+Supported languages:
+- English, French, German, Spanish, Italian, Portuguese, Dutch, Hindi
+  - Or choose Auto to let the model detect the language
+Notes:
+- Uses bfloat16 on GPU and float32 on CPU.
+- Decodes only newly generated tokens for clean transcriptions.

templates/spaces/demo_voxtral/app.py CHANGED Viewed

@@ -1,33 +1,100 @@
 import os
 import gradio as gr
 import torch
-from transformers import AutoProcessor, AutoModelForSeq2SeqLM
 HF_MODEL_ID = os.getenv("HF_MODEL_ID", "mistralai/Voxtral-Mini-3B-2507")
 MODEL_NAME = os.getenv("MODEL_NAME", HF_MODEL_ID.split("/")[-1])
 HF_USERNAME = os.getenv("HF_USERNAME", "")
-processor = AutoProcessor.from_pretrained(HF_MODEL_ID)
-model = AutoModelForSeq2SeqLM.from_pretrained(HF_MODEL_ID, device_map="auto", torch_dtype=torch.bfloat16)
-def transcribe(audio_tuple):
-    if audio_tuple is None:
         return "No audio provided"
-    sr, data = audio_tuple
-    inputs = processor.apply_transcription_request(language="en", model_id=HF_MODEL_ID, audio=[data], format=["WAV"], return_tensors="pt")
-    inputs = {k: (v.to(model.device) if hasattr(v, 'to') else v) for k, v in inputs.items()}
     with torch.no_grad():
-        output_ids = model.generate(**inputs, max_new_tokens=256)
-    # Voxtral returns full sequence; decode and strip special tokens
-    text = processor.tokenizer.decode(output_ids[0], skip_special_tokens=True)
-    return text
 with gr.Blocks() as demo:
     gr.Markdown(f"# 🎙️ Voxtral ASR Demo — {MODEL_NAME}")
-    audio = gr.Audio(sources="microphone", type="numpy", label="Record or upload audio")
     btn = gr.Button("Transcribe")
-    out = gr.Textbox(label="Transcription", lines=4)
-    btn.click(transcribe, inputs=[audio], outputs=[out])
 if __name__ == "__main__":
     demo.launch(mcp_server=True, ssr_mode=False)

 import os
 import gradio as gr
 import torch
+from transformers import AutoProcessor
+try:
+    from transformers import VoxtralForConditionalGeneration as VoxtralModelClass
+except Exception:
+    # Fallback for older transformers versions
+    from transformers import AutoModelForSeq2SeqLM as VoxtralModelClass
 HF_MODEL_ID = os.getenv("HF_MODEL_ID", "mistralai/Voxtral-Mini-3B-2507")
 MODEL_NAME = os.getenv("MODEL_NAME", HF_MODEL_ID.split("/")[-1])
 HF_USERNAME = os.getenv("HF_USERNAME", "")
+MODEL_SUBFOLDER = os.getenv("MODEL_SUBFOLDER", "").strip()
+try:
+    processor = AutoProcessor.from_pretrained(HF_MODEL_ID)
+except Exception:
+    # Fallback: some repos may store processor files inside the subfolder
+    if MODEL_SUBFOLDER:
+        processor = AutoProcessor.from_pretrained(HF_MODEL_ID, subfolder=MODEL_SUBFOLDER)
+    else:
+        raise
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Use float32 on CPU; bfloat16 on CUDA if available
+if torch.cuda.is_available():
+    model_kwargs = {"device_map": "auto", "torch_dtype": torch.bfloat16}
+else:
+    model_kwargs = {"torch_dtype": torch.float32}
+if MODEL_SUBFOLDER:
+    model = VoxtralModelClass.from_pretrained(
+        HF_MODEL_ID, subfolder=MODEL_SUBFOLDER, **model_kwargs
+    )
+else:
+    model = VoxtralModelClass.from_pretrained(
+        HF_MODEL_ID, **model_kwargs
+    )
+# Simple language options (with Auto detection)
+LANGUAGES = {
+    "Auto": "auto",
+    "English": "en",
+    "French": "fr",
+    "German": "de",
+    "Spanish": "es",
+    "Italian": "it",
+    "Portuguese": "pt",
+    "Dutch": "nl",
+    "Hindi": "hi",
+}
+MAX_NEW_TOKENS = 1024
+def transcribe(sel_language, audio_path):
+    if audio_path is None:
         return "No audio provided"
+    language_code = LANGUAGES.get(sel_language, "auto")
+    # Build Voxtral transcription inputs from filepath and selected language
+    if hasattr(processor, "apply_transcrition_request"):
+        inputs = processor.apply_transcrition_request(
+            language=language_code,
+            audio=audio_path,
+            model_id=HF_MODEL_ID,
+        )
+    else:
+        # Compatibility with potential corrected naming
+        inputs = processor.apply_transcription_request(
+            language=language_code,
+            audio=audio_path,
+            model_id=HF_MODEL_ID,
+        )
+    # Move to device with appropriate dtype
+    inputs = inputs.to(device, dtype=(torch.bfloat16 if device == "cuda" else torch.float32))
     with torch.no_grad():
+        output_ids = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
+    # Decode only newly generated tokens (beyond the prompt length)
+    decoded = processor.batch_decode(
+        output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
+    )
+    return decoded[0]
 with gr.Blocks() as demo:
     gr.Markdown(f"# 🎙️ Voxtral ASR Demo — {MODEL_NAME}")
+    with gr.Row():
+        language = gr.Dropdown(
+            choices=list(LANGUAGES.keys()), value="Auto", label="Language"
+        )
+    audio = gr.Audio(
+        sources=["upload", "microphone"],
+        type="filepath",
+        label="Upload or record audio",
+    )
     btn = gr.Button("Transcribe")
+    out = gr.Textbox(label="Transcription", lines=8)
+    btn.click(transcribe, inputs=[language, audio], outputs=[out])
 if __name__ == "__main__":
     demo.launch(mcp_server=True, ssr_mode=False)