Spaces:
Sleeping
Sleeping
| # Finetune Voxtral for ASR with Transformers 🤗 | |
| This repository fine-tunes the [Voxtral](https://huggingface.co/Deep-unlearning/Voxtral) speech model on conversational speech datasets using the Hugging Face `transformers` and `datasets` libraries. | |
| ## Installation | |
| ### Step 1: Clone the repository | |
| ```bash | |
| git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git | |
| cd Finetune-Voxtral-ASR | |
| ``` | |
| ### Step 2: Set up environment | |
| Choose your preferred package manager: | |
| <details> | |
| <summary>📦 Using UV (recommended)</summary> | |
| [Install `uv`](https://docs.astral.sh/uv/getting-started/installation/) | |
| ```bash | |
| uv venv .venv --python 3.10 && source .venv/bin/activate | |
| uv pip install -r requirements.txt | |
| ``` | |
| </details> | |
| <details> | |
| <summary>🐍 Using pip</summary> | |
| ```bash | |
| python -m venv .venv --python 3.10 && source .venv/bin/activate | |
| pip install --upgrade pip | |
| pip install -r requirements.txt | |
| ``` | |
| </details> | |
| ## Dataset Preparation | |
| Perfect — here’s a **drop-in replacement** for your README’s “Dataset Preparation” that matches your script (uses **`hf-audio/esb-datasets-test-only-sorted`** with the **`voxpopuli`** config, 16 kHz casting, and a small train/eval slice), and explains the Voxtral/LLaMA-style prompt+label masking your collator implements. | |
| --- | |
| ## Dataset Preparation | |
| For ASR fine-tuning, inputs look like: | |
| * **Inputs**: `[AUDIO] … [AUDIO] <transcribe> <reference transcription>` | |
| * **Labels**: same sequence, but the prefix `[AUDIO] … [AUDIO] <transcribe>` is **masked with `-100`** so loss is computed **only** on the transcription tokens. | |
| The `VoxtralDataCollator` already builds this sequence (prompt expansion via the processor and label masking). | |
| The dataset only needs two fields: | |
| ```python | |
| { | |
| "audio": {"array": <float32 numpy array>, "sampling_rate": 16000, ...}, | |
| "text": "<reference transcription>" | |
| } | |
| ``` | |
| If you want to swap to a different dataset, ensure after loading you still have: | |
| * an **`audio`** column (cast to `Audio(sampling_rate=16000)`), and | |
| * a **`text`** column (the reference transcription). | |
| If your dataset uses different column names, map them to `audio` and `text` before returning. | |
| ## Training | |
| Run the training script: | |
| ```bash | |
| uv run train.py | |
| ``` | |
| Logs and checkpoints will be saved under the `outputs/` directory by default. | |
| ## Training with LoRA | |
| You can also run the training script with LoRA: | |
| ```bash | |
| uv run train_lora.py | |
| ``` | |
| **Happy fine-tuning Voxtral!** 🚀 |