File size: 7,018 Bytes
0f6c755
 
7b2aced
0f6c755
 
 
 
 
 
 
 
 
512cb02
 
622df64
 
 
 
 
512cb02
 
 
622df64
512cb02
 
 
 
 
 
622df64
512cb02
622df64
512cb02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
622df64
512cb02
622df64
 
512cb02
622df64
512cb02
622df64
512cb02
622df64
512cb02
 
 
622df64
 
512cb02
 
 
622df64
512cb02
622df64
512cb02
622df64
512cb02
622df64
 
512cb02
622df64
512cb02
622df64
512cb02
 
622df64
 
 
 
 
 
512cb02
 
622df64
 
 
 
 
 
 
512cb02
622df64
512cb02
622df64
 
 
 
 
512cb02
 
622df64
 
 
 
 
 
 
512cb02
 
622df64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
title: VoxFactory
emoji: 🌬️
colorFrom: gray
colorTo: red
sdk: gradio
app_file: interface.py
pinned: false
license: mit
short_description: FinetuneASR Voxtral
---

# Finetune Voxtral for ASR with Transformers 🤗

This repository fine-tunes the Voxtral speech model for automatic speech recognition (ASR) using Hugging Face `transformers` and `datasets`. It includes:

- Full and LoRA training scripts
- A Gradio interface to collect audio, build a JSONL dataset, fine-tune, push to Hub, and deploy a demo Space
- Utilities to push trained models and datasets to the Hugging Face Hub

## Installation

### 1) Clone the repository

```bash
git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
cd Finetune-Voxtral-ASR
```

### 2) Create environment and install deps

Choose your package manager.

<details>
<summary>📦 Using UV (recommended)</summary>

```bash
uv venv .venv --python 3.10 && source .venv/bin/activate
uv pip install -r requirements.txt
```

</details>

<details>
<summary>🐍 Using pip</summary>

```bash
python -m venv .venv --python 3.10 && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```

</details>

## Quick start options

- Train from CLI: run `scripts/train.py` (full) or `scripts/train_lora.py` (LoRA)
- Use the Gradio interface: `python interface.py` to record/upload audio, create dataset JSONL, train, push, and deploy a demo Space

## Dataset preparation

Training scripts accept either a local JSONL or a small Hub dataset slice.

- Local JSONL format expected by collators and push utilities:

```python
{
  "audio_path": "/abs/or/relative/path.wav",
  "text": "reference transcription"
}
```

- When loading from the Hub (default fallback): `hf-audio/esb-datasets-test-only-sorted` config `voxpopuli` is used and cast to `Audio(sampling_rate=16000)`.

- The custom `VoxtralDataCollator` constructs inputs as: prompt from audio via `VoxtralProcessor.apply_transcription_request(...)` followed by label tokens. Loss is masked over the prompt; only transcription tokens contribute to loss.

Minimum columns after loading/mapping:

- `audio` cast to `Audio(sampling_rate=16000)` (Hub) or created from `audio_path` (local JSONL)
- `text` transcription string

## Full fine-tuning (scripts/train.py)

Run with either a local JSONL or the default tiny Hub slice:

```bash
python scripts/train.py \
  --model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
  --dataset-jsonl datasets/voxtral_user/data.jsonl \
  --train-count 100 --eval-count 50 \
  --batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
  --output-dir ./voxtral-finetuned
```

Key args:

- `--dataset-jsonl`: local JSONL with `{audio_path, text}`. If omitted, uses `hf-audio/esb-datasets-test-only-sorted`/`voxpopuli` test slice
- `--dataset-name`, `--dataset-config`: override default Hub dataset
- `--train-count`, `--eval-count`: small sample sizes for quick runs
- `--trackio-space`: HF Space ID for Trackio logging; if omitted and `HF_TOKEN` is set, a space name is auto-derived
- `--push-dataset`, `--dataset-repo`: optionally push your local JSONL dataset to the Hub after training

Environment for logging and Hub auth:

- `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`: enables Trackio space naming and Hub uploads

Outputs: model and processor saved to `--output-dir`.

## LoRA fine-tuning (scripts/train_lora.py)

```bash
python scripts/train_lora.py \
  --model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
  --dataset-jsonl datasets/voxtral_user/data.jsonl \
  --train-count 100 --eval-count 50 \
  --batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
  --lora-r 8 --lora-alpha 32 --lora-dropout 0.0 --freeze-audio-tower \
  --output-dir ./voxtral-finetuned-lora
```

Additional LoRA args:

- `--lora-r`, `--lora-alpha`, `--lora-dropout`
- `--freeze-audio-tower`: optionally freeze audio encoder params

## End-to-end via Gradio interface (interface.py)

Start the UI:

```bash
python interface.py
```

What it does:

- Record microphone audio or upload files + transcripts
- Saves datasets to `datasets/voxtral_user/` as `data.jsonl` or `recorded_data.jsonl`
- Kicks off full or LoRA training with streamed logs
- Optionally pushes dataset and model to the Hub
- Optionally deploys a Voxtral ASR demo Space

Environment variables used by the interface:

- `HF_WRITE_TOKEN` or `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`: write/read token for Hub actions
- `HF_READ_TOKEN`: optional read token
- `HF_USERNAME`: fallback username if it cannot be derived from the token

Notes:

- The interface uses a multilingual phrase source (CohereLabs/AYA via token; otherwise localized fallbacks)
- Output models are placed under `outputs/<username_repo>/`

## Push models and datasets to Hugging Face (scripts/push_to_huggingface.py)

Push a trained model directory (full or LoRA):

```bash
python scripts/push_to_huggingface.py model ./voxtral-finetuned my-voxtral-asr \
  --author-name "Your Name" \
  --model-description "Fine-tuned Voxtral ASR" \
  --model-name mistralai/Voxtral-Mini-3B-2507
```

Push a dataset JSONL and its audio files:

```bash
python scripts/push_to_huggingface.py dataset datasets/voxtral_user/data.jsonl my-voxtral-dataset
```

Tips:

- If you pass bare repo names (no `username/`), the tool will resolve your username from the token or `HF_USERNAME`.
- For LoRA outputs, the pusher detects adapter files; for full models it detects `config.json` + weight files and uploads accordingly.

## Deploy a demo Space (scripts/deploy_demo_space.py)

Deploy a Voxtral demo Space for a pushed model:

```bash
python scripts/deploy_demo_space.py \
  --hf-token $HF_TOKEN \
  --hf-username your-hf-username \
  --model-id your-hf-username/your-model-repo \
  --demo-type voxtral \
  --space-name my-voxtral-demo
```

What it does:

- Creates the Space (or use `--skip-creation` to only upload)
- Uploads template files from `templates/spaces/demo_voxtral/`
- Sets space variables and secrets (e.g., `HF_TOKEN`, `HF_MODEL_ID`) via API
- Waits for the Space to build and tests accessibility

The Space app loads either a full model or a base+LoRA adapter with `peft`, and uses `AutoProcessor` to build Voxtral transcription requests.

## GPU and versions

- Torch 2.8.0 + torchaudio 2.8.0 and `torchcodec==0.7` are specified; CUDA-capable GPU is recommended for training
- The code prefers `bfloat16` on CUDA, `float32` on CPU

## Troubleshooting

- No token found:
  - Set `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`) in your environment for Hub operations and Trackio naming
- Invalid token or username resolution failed:
  - Provide fully-qualified repo IDs like `username/repo` or set `HF_USERNAME`
- Demo Space rate limits / propagation delays:
  - The deploy script retries uploads and may need extra time for the Space to build
- Collator errors:
  - Ensure your JSONL rows include valid `audio_path` files and `text` strings
- Windows shell hints:
  - Use `set HF_TOKEN=your_token` in CMD/PowerShell before running scripts

## License

MIT