File size: 10,240 Bytes
2c7a306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb24a22
1174dbe
 
 
 
 
 
 
 
 
 
 
 
 
8796676
 
 
 
 
 
 
 
 
 
 
 
 
2c7a306
 
 
 
 
 
 
 
f41d52b
aa109a7
2c7a306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
696cc8c
2c7a306
 
696cc8c
2c7a306
696cc8c
2c7a306
6251725
2c7a306
ce2f4a1
 
696cc8c
2c7a306
1174dbe
 
 
 
 
696cc8c
1174dbe
 
 
ce2f4a1
 
 
 
696cc8c
 
 
ce2f4a1
696cc8c
 
 
ce2f4a1
2c7a306
ac5afeb
 
ce2f4a1
ac5afeb
2c7a306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: mit
language:
- en
base_model:
- distil-whisper/distil-large-v3.5
pipeline_tag: automatic-speech-recognition
library_name: transformers
model-index:
- name: whisper-large-v3-singlish-DRAFT
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: SASRBench-v1
      type: mjwong/SASRBench-v1
      split: test
    metrics:
      - name: WER
        type: WER
        value: 14.84
- name: whisper-large-v3-singlish-DRAFT
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: AMI
      type: edinburghcstr/ami
      subset: ihm
      split: test
    metrics:
      - name: WER
        type: WER
        value: 22.06
- name: whisper-large-v3-singlish-DRAFT
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: GigaSpeech
      type: speechcolab/gigaspeech
      subset: test
      split: test
    metrics:
      - name: WER
        type: WER
        value: 12.81
tags:
- whisper
---

# Whisper large-v3-singlish-DRAFT

**Whisper large-v3-singlish-DRAFT** is a fine-tuned automatic speech recognition (ASR) model tailored specifically for Singlish. Based on OpenAI’s Whisper architecture, this model has been optimized with Singlish-centric data to better capture the distinctive phonetic patterns and vocabulary commonly found in Singlish speech. It is designed to serve as a lightweight draft model in speculative decoding pipelines, working in tandem with [Whisper large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish) as the target model to improve transcription speed while maintaining accuracy.

>**Note**: All results presented here for Whisper large-v3-singlish-DRAFT were obtained using the speculative decoding variant of Whisper large-v3-singlish (DRAFT mode).

## Model Details

- **Developed by:** Ming Jie Wong
- **Base Model:** [distil-whisper/distil-large-v3.5](https://huggingface.co/distil-whisper/distil-large-v3.5)
- **Model Type:** Encoder-decoder
- **Metrics:** Word Error Rate (WER)
- **Languages Supported:** English (with a focus on Singlish)
- **License:** MIT

### Description

Whisper-large-v3-singlish-DRAFT was trained using pseudo-labels generated by its target model, Whisper-large-v3-singlish. The target model transcribed 66.9k audio recordings sourced from Part 3 of the Same Room Environment Close-talk Microphone section of [IMDA’s National Speech Corpus (NSC)](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). This self-distillation approach ensures high alignment between the draft and target models, enabling effective speculative decoding in Singlish speech.

The original Part 3 of the National Speech Corpus comprises approximately 1,000 hours of conversational speech from around 1,000 local English speakers, recorded in pairs. These conversations cover everyday topics and include interactive game-based dialogues. Recordings were conducted in two environments:
- Same Room, where speakers shared a room and were recorded using a close-talk mic and a boundary mic.
- Separate Room, where each speaker was recorded individually using a standing mic and a telephone (IVR).

Audio segments for the internal dataset were extracted using these criteria:
- **Minimum Word Count:** 10 words
  
  _This threshold was chosen to ensure that each audio segment contains sufficient linguistic context for the model to better understand instructions in Singlish. Shorter segments may bias the model towards specific utterances or phrases, limiting its overall comprehension._
- **Maximum Duration:** 20 seconds

  _This threshold was chosen to provide enough context for accurate transcription while minimizing noise and computational complexity for longer audio segments._
- **Sampling Rate**: All audio segments are down-sampled to 16kHz.

Full experiments details will be added soon.

### Fine-Tuning Details

We applied fine-tuning on a single A100-80GB GPU.

#### Training Hyperparameters
The following hyperparameters are used:
- **batch_size**: 16
- **gradient_accumulation_steps**: 1
- **learning_rate**: 1e-6
- **warmup_steps**: 300
- **max_steps**: 5000
- **fp16**: true
- **eval_batch_size**: 16
- **eval_step**: 300
- **max_grad_norm**: 1.0
- **generation_max_length**: 225

## Benchmark Performance


We evaluated the speculative decoding setup for Whisper-large-v3-singlish on the following datasets:

- [SASRBench-v1](https://huggingface.co/datasets/mjwong/SASRBench-v1): A benchmark dataset for evaluating ASR performance on Singlish.

- [AMI](https://huggingface.co/datasets/edinburghcstr/ami): A widely used dataset for meeting transcription and diarization tasks. This work specifically uses the IHM (Individual Headset Microphone) recordings.

- [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech): A large-scale open-source dataset with diverse English audio, covering read, conversational, and spontaneous speech.

### Model Performance

| **Dataset**     | **Model Variant**         | **Link**                                                                                           | **Rel. RTFx** | **WER**    |
|-----------------|---------------------------|----------------------------------------------------------------------------------------------------|---------------|------------|
| SASRBench-v1    | Large                     | [Whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish)               | 1.00          | 16.41%     |
| SASRBench-v1    | Large-Turbo               | [Whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish)   | **2.36**      | **13.35%** |
| SASRBench-v1    | Draft-enhanced Large      | Whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | 2.20          | 14.84%     |
||||||
| AMI             | Large                     | [Whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish)               | 1.00          | 23.72%     |
| AMI             | Large-Turbo               | [Whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish)   | 1.53          | **16.99%** |
| AMI             | Draft-enhanced Large      | Whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | **2.27**      | 22.06%     |
||||||
| GigaSpeech      | Large                     | [Whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish)               | 1.00          | 13.15%     |
| GigaSpeech      | Large-Turbo               | [Whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish)   | 1.95          | **11.54%** |
| GigaSpeech      | Draft-enhanced Large      | Whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | **2.37**      | 12.81%     |

### Speculative Acceptance Rates (DRAFT-enhanced Large Model)

| **Dataset**    | **Micro Avg Acceptance** | **Macro Avg Acceptance**  |
|----------------|--------------------------|---------------------------|
| SASRBench-v1   | 38.00%                   | 42.00%                    |
| AMI            | 38.00%                   | 43.00%                    |
| GigaSpeech     | 31.00%                   | 37.00%                    |

### Conclusion

While it does not outperform Large-Turbo in WER, the Draft-enhanced Large model demonstrates strong speculative acceptance rates (~31–43%), indicating meaningful potential for runtime gains through early prediction acceptance. In latency-sensitive applications, it offers a compelling middle ground between the high accuracy of Large-Turbo and the slower inference of standard decoding.

## Disclaimer

While this model has been fine-tuned to better recognize Singlish, users may experience inaccuracies, biases, or unexpected outputs, particularly in challenging audio conditions or with speakers using non-standard variations. Use of this model is at your own risk; the developers and distributors are not liable for any consequences arising from its use. Please validate results before deploying in any sensitive or production environment.

## How to use the model

Whisper-large-v3-singlish-DRAFT can be leveraged as an assistant model in a speculative decoding setup with Whisper-large-v3-singlish as the target. The assistant model proposes initial tokens, which are selectively verified by the target model to accelerate inference without sacrificing accuracy.

```python
import torch
from transformers import (
    pipeline,
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor
)

TARGET_REPO_NAME = "mjwong/whisper-large-v3-singlish"
DRAFT_REPO_NAME = "mjwong/whisper-large-v3-singlish-DRAFT"

# Select appropriate device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the draft model (used as the assistant in speculative decoding)
assistant_model = AutoModelForCausalLM.from_pretrained(
    DRAFT_REPO_NAME, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

# Load the main target model (the high-accuracy decoder)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    TARGET_REPO_NAME, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# Load processor (tokenizer + feature extractor)
processor = AutoProcessor.from_pretrained(TARGET_REPO_NAME)

# Create the ASR pipeline with speculative decoding
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)
```

You can then use this pipeline to transcribe audios of arbitrary length.

```python
from datasets import load_dataset
dataset = load_dataset("mjwong/SASRBench-v1", split="test")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])
```

## Contact

For more information, please reach out to [email protected].