Joseph Pollack commited on
Commit
622df64
·
unverified ·
1 Parent(s): 4d542a9

adds readme

Browse files
Files changed (2) hide show
  1. README.md +146 -31
  2. simple_test.py → tests/simple_test.py +0 -0
README.md CHANGED
@@ -12,26 +12,28 @@ short_description: FinetuneASR Voxtral
12
 
13
  # Finetune Voxtral for ASR with Transformers 🤗
14
 
15
- This repository fine-tunes the [Voxtral](https://huggingface.co/Deep-unlearning/Voxtral) speech model on conversational speech datasets using the Hugging Face `transformers` and `datasets` libraries.
 
 
 
 
16
 
17
  ## Installation
18
 
19
- ### Step 1: Clone the repository
20
 
21
  ```bash
22
  git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
23
  cd Finetune-Voxtral-ASR
24
  ```
25
 
26
- ### Step 2: Set up environment
27
 
28
- Choose your preferred package manager:
29
 
30
  <details>
31
  <summary>📦 Using UV (recommended)</summary>
32
 
33
- [Install `uv`](https://docs.astral.sh/uv/getting-started/installation/)
34
-
35
  ```bash
36
  uv venv .venv --python 3.10 && source .venv/bin/activate
37
  uv pip install -r requirements.txt
@@ -50,53 +52,166 @@ pip install -r requirements.txt
50
 
51
  </details>
52
 
53
- ## Dataset Preparation
54
-
55
- Perfect — here’s a **drop-in replacement** for your README’s “Dataset Preparation” that matches your script (uses **`hf-audio/esb-datasets-test-only-sorted`** with the **`voxpopuli`** config, 16 kHz casting, and a small train/eval slice), and explains the Voxtral/LLaMA-style prompt+label masking your collator implements.
56
 
57
- ---
58
-
59
- ## Dataset Preparation
60
 
61
- For ASR fine-tuning, inputs look like:
62
 
63
- * **Inputs**: `[AUDIO] [AUDIO] <transcribe> <reference transcription>`
64
- * **Labels**: same sequence, but the prefix `[AUDIO] … [AUDIO] <transcribe>` is **masked with `-100`** so loss is computed **only** on the transcription tokens.
65
 
66
- The `VoxtralDataCollator` already builds this sequence (prompt expansion via the processor and label masking).
67
- The dataset only needs two fields:
68
 
69
  ```python
70
  {
71
- "audio": {"array": <float32 numpy array>, "sampling_rate": 16000, ...},
72
- "text": "<reference transcription>"
73
  }
74
  ```
75
 
 
76
 
77
- If you want to swap to a different dataset, ensure after loading you still have:
78
 
79
- * an **`audio`** column (cast to `Audio(sampling_rate=16000)`), and
80
- * a **`text`** column (the reference transcription).
81
 
82
- If your dataset uses different column names, map them to `audio` and `text` before returning.
 
83
 
84
- ## Training
85
 
86
- Run the training script:
87
 
88
  ```bash
89
- uv run train.py
 
 
 
 
 
90
  ```
91
 
92
- Logs and checkpoints will be saved under the `outputs/` directory by default.
 
 
 
 
 
 
93
 
94
- ## Training with LoRA
95
 
96
- You can also run the training script with LoRA:
 
 
 
 
97
 
98
  ```bash
99
- uv run train_lora.py
 
 
 
 
 
 
100
  ```
101
 
102
- **Happy fine-tuning Voxtral!** 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  # Finetune Voxtral for ASR with Transformers 🤗
14
 
15
+ This repository fine-tunes the Voxtral speech model for automatic speech recognition (ASR) using Hugging Face `transformers` and `datasets`. It includes:
16
+
17
+ - Full and LoRA training scripts
18
+ - A Gradio interface to collect audio, build a JSONL dataset, fine-tune, push to Hub, and deploy a demo Space
19
+ - Utilities to push trained models and datasets to the Hugging Face Hub
20
 
21
  ## Installation
22
 
23
+ ### 1) Clone the repository
24
 
25
  ```bash
26
  git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
27
  cd Finetune-Voxtral-ASR
28
  ```
29
 
30
+ ### 2) Create environment and install deps
31
 
32
+ Choose your package manager.
33
 
34
  <details>
35
  <summary>📦 Using UV (recommended)</summary>
36
 
 
 
37
  ```bash
38
  uv venv .venv --python 3.10 && source .venv/bin/activate
39
  uv pip install -r requirements.txt
 
52
 
53
  </details>
54
 
55
+ ## Quick start options
 
 
56
 
57
+ - Train from CLI: run `scripts/train.py` (full) or `scripts/train_lora.py` (LoRA)
58
+ - Use the Gradio interface: `python interface.py` to record/upload audio, create dataset JSONL, train, push, and deploy a demo Space
 
59
 
60
+ ## Dataset preparation
61
 
62
+ Training scripts accept either a local JSONL or a small Hub dataset slice.
 
63
 
64
+ - Local JSONL format expected by collators and push utilities:
 
65
 
66
  ```python
67
  {
68
+ "audio_path": "/abs/or/relative/path.wav",
69
+ "text": "reference transcription"
70
  }
71
  ```
72
 
73
+ - When loading from the Hub (default fallback): `hf-audio/esb-datasets-test-only-sorted` config `voxpopuli` is used and cast to `Audio(sampling_rate=16000)`.
74
 
75
+ - The custom `VoxtralDataCollator` constructs inputs as: prompt from audio via `VoxtralProcessor.apply_transcription_request(...)` followed by label tokens. Loss is masked over the prompt; only transcription tokens contribute to loss.
76
 
77
+ Minimum columns after loading/mapping:
 
78
 
79
+ - `audio` cast to `Audio(sampling_rate=16000)` (Hub) or created from `audio_path` (local JSONL)
80
+ - `text` transcription string
81
 
82
+ ## Full fine-tuning (scripts/train.py)
83
 
84
+ Run with either a local JSONL or the default tiny Hub slice:
85
 
86
  ```bash
87
+ python scripts/train.py \
88
+ --model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
89
+ --dataset-jsonl datasets/voxtral_user/data.jsonl \
90
+ --train-count 100 --eval-count 50 \
91
+ --batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
92
+ --output-dir ./voxtral-finetuned
93
  ```
94
 
95
+ Key args:
96
+
97
+ - `--dataset-jsonl`: local JSONL with `{audio_path, text}`. If omitted, uses `hf-audio/esb-datasets-test-only-sorted`/`voxpopuli` test slice
98
+ - `--dataset-name`, `--dataset-config`: override default Hub dataset
99
+ - `--train-count`, `--eval-count`: small sample sizes for quick runs
100
+ - `--trackio-space`: HF Space ID for Trackio logging; if omitted and `HF_TOKEN` is set, a space name is auto-derived
101
+ - `--push-dataset`, `--dataset-repo`: optionally push your local JSONL dataset to the Hub after training
102
 
103
+ Environment for logging and Hub auth:
104
 
105
+ - `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`: enables Trackio space naming and Hub uploads
106
+
107
+ Outputs: model and processor saved to `--output-dir`.
108
+
109
+ ## LoRA fine-tuning (scripts/train_lora.py)
110
 
111
  ```bash
112
+ python scripts/train_lora.py \
113
+ --model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
114
+ --dataset-jsonl datasets/voxtral_user/data.jsonl \
115
+ --train-count 100 --eval-count 50 \
116
+ --batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
117
+ --lora-r 8 --lora-alpha 32 --lora-dropout 0.0 --freeze-audio-tower \
118
+ --output-dir ./voxtral-finetuned-lora
119
  ```
120
 
121
+ Additional LoRA args:
122
+
123
+ - `--lora-r`, `--lora-alpha`, `--lora-dropout`
124
+ - `--freeze-audio-tower`: optionally freeze audio encoder params
125
+
126
+ ## End-to-end via Gradio interface (interface.py)
127
+
128
+ Start the UI:
129
+
130
+ ```bash
131
+ python interface.py
132
+ ```
133
+
134
+ What it does:
135
+
136
+ - Record microphone audio or upload files + transcripts
137
+ - Saves datasets to `datasets/voxtral_user/` as `data.jsonl` or `recorded_data.jsonl`
138
+ - Kicks off full or LoRA training with streamed logs
139
+ - Optionally pushes dataset and model to the Hub
140
+ - Optionally deploys a Voxtral ASR demo Space
141
+
142
+ Environment variables used by the interface:
143
+
144
+ - `HF_WRITE_TOKEN` or `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`: write/read token for Hub actions
145
+ - `HF_READ_TOKEN`: optional read token
146
+ - `HF_USERNAME`: fallback username if it cannot be derived from the token
147
+
148
+ Notes:
149
+
150
+ - The interface uses a multilingual phrase source (CohereLabs/AYA via token; otherwise localized fallbacks)
151
+ - Output models are placed under `outputs/<username_repo>/`
152
+
153
+ ## Push models and datasets to Hugging Face (scripts/push_to_huggingface.py)
154
+
155
+ Push a trained model directory (full or LoRA):
156
+
157
+ ```bash
158
+ python scripts/push_to_huggingface.py model ./voxtral-finetuned my-voxtral-asr \
159
+ --author-name "Your Name" \
160
+ --model-description "Fine-tuned Voxtral ASR" \
161
+ --model-name mistralai/Voxtral-Mini-3B-2507
162
+ ```
163
+
164
+ Push a dataset JSONL and its audio files:
165
+
166
+ ```bash
167
+ python scripts/push_to_huggingface.py dataset datasets/voxtral_user/data.jsonl my-voxtral-dataset
168
+ ```
169
+
170
+ Tips:
171
+
172
+ - If you pass bare repo names (no `username/`), the tool will resolve your username from the token or `HF_USERNAME`.
173
+ - For LoRA outputs, the pusher detects adapter files; for full models it detects `config.json` + weight files and uploads accordingly.
174
+
175
+ ## Deploy a demo Space (scripts/deploy_demo_space.py)
176
+
177
+ Deploy a Voxtral demo Space for a pushed model:
178
+
179
+ ```bash
180
+ python scripts/deploy_demo_space.py \
181
+ --hf-token $HF_TOKEN \
182
+ --hf-username your-hf-username \
183
+ --model-id your-hf-username/your-model-repo \
184
+ --demo-type voxtral \
185
+ --space-name my-voxtral-demo
186
+ ```
187
+
188
+ What it does:
189
+
190
+ - Creates the Space (or use `--skip-creation` to only upload)
191
+ - Uploads template files from `templates/spaces/demo_voxtral/`
192
+ - Sets space variables and secrets (e.g., `HF_TOKEN`, `HF_MODEL_ID`) via API
193
+ - Waits for the Space to build and tests accessibility
194
+
195
+ The Space app loads either a full model or a base+LoRA adapter with `peft`, and uses `AutoProcessor` to build Voxtral transcription requests.
196
+
197
+ ## GPU and versions
198
+
199
+ - Torch 2.8.0 + torchaudio 2.8.0 and `torchcodec==0.7` are specified; CUDA-capable GPU is recommended for training
200
+ - The code prefers `bfloat16` on CUDA, `float32` on CPU
201
+
202
+ ## Troubleshooting
203
+
204
+ - No token found:
205
+ - Set `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`) in your environment for Hub operations and Trackio naming
206
+ - Invalid token or username resolution failed:
207
+ - Provide fully-qualified repo IDs like `username/repo` or set `HF_USERNAME`
208
+ - Demo Space rate limits / propagation delays:
209
+ - The deploy script retries uploads and may need extra time for the Space to build
210
+ - Collator errors:
211
+ - Ensure your JSONL rows include valid `audio_path` files and `text` strings
212
+ - Windows shell hints:
213
+ - Use `set HF_TOKEN=your_token` in CMD/PowerShell before running scripts
214
+
215
+ ## License
216
+
217
+ MIT
simple_test.py → tests/simple_test.py RENAMED
File without changes