Spaces:

Tonic
/

VoxFactory

Sleeping

App Files Files Community

VoxFactory / README.md

Steveeeeeeen HF Staff

add readme

512cb02 3 months ago

preview code

raw

history blame

2.46 kB

	# Finetune Voxtral for ASR with Transformers 🤗

	This repository fine-tunes the [Voxtral](https://huggingface.co/Deep-unlearning/Voxtral) speech model on conversational speech datasets using the Hugging Face `transformers` and `datasets` libraries.

	## Installation

	### Step 1: Clone the repository

	```bash
	git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
	cd Finetune-Voxtral-ASR
	```

	### Step 2: Set up environment

	Choose your preferred package manager:

	<details>
	<summary>📦 Using UV (recommended)</summary>

	[Install `uv`](https://docs.astral.sh/uv/getting-started/installation/)

	```bash
	uv venv .venv --python 3.10 && source .venv/bin/activate
	uv pip install -r requirements.txt
	```

	</details>

	<details>
	<summary>🐍 Using pip</summary>

	```bash
	python -m venv .venv --python 3.10 && source .venv/bin/activate
	pip install --upgrade pip
	pip install -r requirements.txt
	```

	</details>

	## Dataset Preparation

	Perfect — here’s a drop-in replacement for your README’s “Dataset Preparation” that matches your script (uses `hf-audio/esb-datasets-test-only-sorted` with the `voxpopuli` config, 16 kHz casting, and a small train/eval slice), and explains the Voxtral/LLaMA-style prompt+label masking your collator implements.

	---

	## Dataset Preparation

	For ASR fine-tuning, inputs look like:

	* Inputs: `[AUDIO] … [AUDIO] <transcribe> <reference transcription>`
	* Labels: same sequence, but the prefix `[AUDIO] … [AUDIO] <transcribe>` is masked with `-100` so loss is computed only on the transcription tokens.

	The `VoxtralDataCollator` already builds this sequence (prompt expansion via the processor and label masking).
	The dataset only needs two fields:

	```python
	{
	"audio": {"array": <float32 numpy array>, "sampling_rate": 16000, ...},
	"text": "<reference transcription>"
	}
	```


	If you want to swap to a different dataset, ensure after loading you still have:

	* an `audio` column (cast to `Audio(sampling_rate=16000)`), and
	* a `text` column (the reference transcription).

	If your dataset uses different column names, map them to `audio` and `text` before returning.

	## Training

	Run the training script:

	```bash
	uv run train.py
	```

	Logs and checkpoints will be saved under the `outputs/` directory by default.

	## Training with LoRA

	You can also run the training script with LoRA:

	```bash
	uv run train_lora.py
	```

	Happy fine-tuning Voxtral! 🚀