mjwong commited on
Commit
2c7a306
·
verified ·
1 Parent(s): d1201bd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -0
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - distil-whisper/distil-large-v3.5
7
+ pipeline_tag: automatic-speech-recognition
8
+ library_name: transformers
9
+ model-index:
10
+ - name: whisper-large-v3-singlish-DRAFT
11
+ results:
12
+ - task:
13
+ type: automatic-speech-recognition
14
+ dataset:
15
+ name: SASRBench-v1
16
+ type: mjwong/SASRBench-v1
17
+ split: test
18
+ metrics:
19
+ - name: WER
20
+ type: WER
21
+ value: 14.89
22
+ tags:
23
+ - whisper
24
+ ---
25
+
26
+ # Whisper large-v3-singlish-DRAFT
27
+
28
+ **Whisper large-v3-singlish-DRAFT** is a fine-tuned automatic speech recognition (ASR) model tailored specifically for Singlish. Based on OpenAI’s Whisper architecture, this model has been optimized with Singlish-centric data to better capture the distinctive phonetic patterns and vocabulary commonly found in Singlish speech. It is designed to serve as a lightweight draft model in speculative decoding pipelines, working in tandem with [Whisper large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish) as the target model to improve transcription speed while maintaining accuracy.
29
+
30
+ ## Model Details
31
+
32
+ - **Developed by:** Ming Jie Wong
33
+ - **Base Model:** [distil-whisper/distil-large-v3.5](https://huggingface.co/distil-whisper/distil-large-v3.5)
34
+ - **Model Type:** Encoder-decoder
35
+ - **Metrics:** Word Error Rate (WER)
36
+ - **Languages Supported:** English (with a focus on Singlish)
37
+ - **License:** MIT
38
+
39
+ ### Description
40
+
41
+ Whisper-large-v3-singlish-DRAFT was trained using pseudo-labels generated by its target model, Whisper-large-v3-singlish. The target model transcribed 66.9k audio recordings sourced from Part 3 of the Same Room Environment Close-talk Microphone section of [IMDA’s National Speech Corpus (NSC)](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). This self-distillation approach ensures high alignment between the draft and target models, enabling effective speculative decoding in Singlish speech.
42
+
43
+ The original Part 3 of the National Speech Corpus comprises approximately 1,000 hours of conversational speech from around 1,000 local English speakers, recorded in pairs. These conversations cover everyday topics and include interactive game-based dialogues. Recordings were conducted in two environments:
44
+ - Same Room, where speakers shared a room and were recorded using a close-talk mic and a boundary mic.
45
+ - Separate Room, where each speaker was recorded individually using a standing mic and a telephone (IVR).
46
+
47
+ Audio segments for the internal dataset were extracted using these criteria:
48
+ - **Minimum Word Count:** 10 words
49
+
50
+ _This threshold was chosen to ensure that each audio segment contains sufficient linguistic context for the model to better understand instructions in Singlish. Shorter segments may bias the model towards specific utterances or phrases, limiting its overall comprehension._
51
+ - **Maximum Duration:** 20 seconds
52
+
53
+ _This threshold was chosen to provide enough context for accurate transcription while minimizing noise and computational complexity for longer audio segments._
54
+ - **Sampling Rate**: All audio segments are down-sampled to 16kHz.
55
+
56
+ Full experiments details will be added soon.
57
+
58
+ ### Fine-Tuning Details
59
+
60
+ We applied fine-tuning on a single A100-80GB GPU.
61
+
62
+ #### Training Hyperparameters
63
+ The following hyperparameters are used:
64
+ - **batch_size**: 16
65
+ - **gradient_accumulation_steps**: 1
66
+ - **learning_rate**: 1e-6
67
+ - **warmup_steps**: 300
68
+ - **max_steps**: 5000
69
+ - **fp16**: true
70
+ - **eval_batch_size**: 16
71
+ - **eval_step**: 300
72
+ - **max_grad_norm**: 1.0
73
+ - **generation_max_length**: 225
74
+
75
+ ### Benchmark Performance
76
+
77
+ We evaluated the speculative decoding setup for Whisper large-v3-singlish on [SASRBench-v1](https://huggingface.co/datasets/mjwong/SASRBench-v1), a benchmark dataset for evaluating ASR performance on Singlish:
78
+
79
+ #### Model Performance
80
+
81
+ | **Model** | **Rel. RTFx** | **WER** |
82
+ |----------------------------------------------------------------------------------------------------|---------------|-----------|
83
+ | [Whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish) | 1.00 | 16.45% |
84
+ | [Whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish) | 2.36 | 13.35% |
85
+ | Whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | 2.20 | 14.89% |
86
+
87
+
88
+ #### Speculative Acceptance Rates
89
+
90
+ | **Speculative Setup** | **Micro Avg Acceptance** | **Macro Avg Acceptance** |
91
+ |-------------------------------------------------------------------------------------------------------|---------------------------|--------------------------|
92
+ | Whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | 38.00% | 42.00% |
93
+
94
+ ## Disclaimer
95
+
96
+ While this model has been fine-tuned to better recognize Singlish, users may experience inaccuracies, biases, or unexpected outputs, particularly in challenging audio conditions or with speakers using non-standard variations. Use of this model is at your own risk; the developers and distributors are not liable for any consequences arising from its use. Please validate results before deploying in any sensitive or production environment.
97
+
98
+ ## How to use the model
99
+
100
+ Whisper-large-v3-singlish-DRAFT can be leveraged as an assistant model in a speculative decoding setup with Whisper-large-v3-singlish as the target. The assistant model proposes initial tokens, which are selectively verified by the target model to accelerate inference without sacrificing accuracy.
101
+
102
+ ```python
103
+ import torch
104
+ from transformers import (
105
+ pipeline,
106
+ AutoModelForCausalLM,
107
+ AutoModelForSpeechSeq2Seq,
108
+ AutoProcessor
109
+ )
110
+
111
+ TARGET_REPO_NAME = "mjwong/whisper-large-v3-singlish"
112
+ DRAFT_REPO_NAME = "mjwong/whisper-large-v3-singlish-DRAFT"
113
+
114
+ # Select appropriate device and precision
115
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
116
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
117
+
118
+ # Load the draft model (used as the assistant in speculative decoding)
119
+ assistant_model = AutoModelForCausalLM.from_pretrained(
120
+ DRAFT_REPO_NAME, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
121
+ )
122
+ assistant_model.to(device)
123
+
124
+ # Load the main target model (the high-accuracy decoder)
125
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
126
+ TARGET_REPO_NAME, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
127
+ )
128
+ model.to(device)
129
+
130
+ # Load processor (tokenizer + feature extractor)
131
+ processor = AutoProcessor.from_pretrained(TARGET_REPO_NAME)
132
+
133
+ # Create the ASR pipeline with speculative decoding
134
+ pipe = pipeline(
135
+ "automatic-speech-recognition",
136
+ model=model,
137
+ tokenizer=processor.tokenizer,
138
+ feature_extractor=processor.feature_extractor,
139
+ max_new_tokens=128,
140
+ generate_kwargs={"assistant_model": assistant_model},
141
+ torch_dtype=torch_dtype,
142
+ device=device,
143
+ )
144
+ ```
145
+
146
+ You can then use this pipeline to transcribe audios of arbitrary length.
147
+
148
+ ```python
149
+ from datasets import load_dataset
150
+ dataset = load_dataset("mjwong/SASRBench-v1", split="test")
151
+ sample = dataset[0]["audio"]
152
+
153
+ result = pipe(sample)
154
+ print(result["text"])
155
+ ```
156
+
157
+ ## Contact
158
+
159
+ For more information, please reach out to [email protected].