Steveeeeeeen HF Staff commited on
Commit
512cb02
·
1 Parent(s): 5e49ae7

add readme

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Finetune Voxtral for ASR with Transformers 🤗
2
+
3
+ This repository fine-tunes the [Voxtral](https://huggingface.co/Deep-unlearning/Voxtral) speech model on conversational speech datasets using the Hugging Face `transformers` and `datasets` libraries.
4
+
5
+ ## Installation
6
+
7
+ ### Step 1: Clone the repository
8
+
9
+ ```bash
10
+ git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
11
+ cd Finetune-Voxtral-ASR
12
+ ```
13
+
14
+ ### Step 2: Set up environment
15
+
16
+ Choose your preferred package manager:
17
+
18
+ <details>
19
+ <summary>📦 Using UV (recommended)</summary>
20
+
21
+ [Install `uv`](https://docs.astral.sh/uv/getting-started/installation/)
22
+
23
+ ```bash
24
+ uv venv .venv --python 3.10 && source .venv/bin/activate
25
+ uv pip install -r requirements.txt
26
+ ```
27
+
28
+ </details>
29
+
30
+ <details>
31
+ <summary>🐍 Using pip</summary>
32
+
33
+ ```bash
34
+ python -m venv .venv --python 3.10 && source .venv/bin/activate
35
+ pip install --upgrade pip
36
+ pip install -r requirements.txt
37
+ ```
38
+
39
+ </details>
40
+
41
+ ## Dataset Preparation
42
+
43
+ Perfect — here’s a **drop-in replacement** for your README’s “Dataset Preparation” that matches your script (uses **`hf-audio/esb-datasets-test-only-sorted`** with the **`voxpopuli`** config, 16 kHz casting, and a small train/eval slice), and explains the Voxtral/LLaMA-style prompt+label masking your collator implements.
44
+
45
+ ---
46
+
47
+ ## Dataset Preparation
48
+
49
+ For ASR fine-tuning, inputs look like:
50
+
51
+ * **Inputs**: `[AUDIO] … [AUDIO] <transcribe> <reference transcription>`
52
+ * **Labels**: same sequence, but the prefix `[AUDIO] … [AUDIO] <transcribe>` is **masked with `-100`** so loss is computed **only** on the transcription tokens.
53
+
54
+ The `VoxtralDataCollator` already builds this sequence (prompt expansion via the processor and label masking).
55
+ The dataset only needs two fields:
56
+
57
+ ```python
58
+ {
59
+ "audio": {"array": <float32 numpy array>, "sampling_rate": 16000, ...},
60
+ "text": "<reference transcription>"
61
+ }
62
+ ```
63
+
64
+
65
+ If you want to swap to a different dataset, ensure after loading you still have:
66
+
67
+ * an **`audio`** column (cast to `Audio(sampling_rate=16000)`), and
68
+ * a **`text`** column (the reference transcription).
69
+
70
+ If your dataset uses different column names, map them to `audio` and `text` before returning.
71
+
72
+ ## Training
73
+
74
+ Run the training script:
75
+
76
+ ```bash
77
+ uv run train.py
78
+ ```
79
+
80
+ Logs and checkpoints will be saved under the `outputs/` directory by default.
81
+
82
+ ## Training with LoRA
83
+
84
+ You can also run the training script with LoRA:
85
+
86
+ ```bash
87
+ uv run train_lora.py
88
+ ```
89
+
90
+ **Happy fine-tuning Voxtral!** 🚀