File size: 12,874 Bytes
172d34d
768c866
172d34d
768c866
172d34d
 
 
 
768c866
 
 
172d34d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
768c866
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172d34d
 
768c866
 
 
 
 
172d34d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a5ecf6
172d34d
 
 
 
 
 
9681b81
 
 
 
 
 
 
172d34d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3524c13
172d34d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
768c866
172d34d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3524c13
172d34d
 
 
 
 
 
 
 
768c866
172d34d
 
 
 
 
 
 
 
 
 
 
 
 
 
768c866
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
language:
- en
- hi
tags:
- audio
- automatic-speech-recognition
- whisper-event
- pytorch
- hinglish
inference: true
model-index:
- name: Whisper-Hindi2Hinglish-Prime
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: google/fleurs
      type: google/fleurs
      config: hi_in
      split: test
    metrics:
    - type: wer
      value: 28.6806
      name: WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: mozilla-foundation/common_voice_20_0
      type: mozilla-foundation/common_voice_20_0
      config: hi
      split: test
    metrics:
    - type: wer
      value: 32.4314
      name: WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Indic-Voices
      type: Indic-Voices
      config: hi
      split: test
    metrics:
    - type: wer
      value: 60.8224
      name: WER
widget:
- src: audios/c0637211-7384-4abc-af69-5aacf7549824_1_2629072_2656224.wav
  output:
    text: Mehnat to poora karte hain.
- src: audios/c0faba11-27ba-4837-a2eb-ccd67be07f40_1_3185088_3227568.wav
  output:
    text: Haan vahi ek aapko bataaya na.
- src: audios/663eb653-d6b5-4fda-b5f2-9ef98adc0a61_0_1098400_1118688.wav
  output:
    text: Aap pandrah log hain.
- src: audios/f5e0178c-354c-40c9-b3a7-687c86240a77_1_2613728_2630112.wav
  output:
    text: Kitne saal ki?
- src: audios/f5e0178c-354c-40c9-b3a7-687c86240a77_1_1152496_1175488.wav
  output:
    text: Lander cycle chaahie.
- src: audios/c0637211-7384-4abc-af69-5aacf7549824_1_2417088_2444224.wav
  output:
    text: Haan haan, dekhe hain.
- src: audios/common_voice_hi_23796065.mp3
  example_title: Speech Example 1
- src: audios/common_voice_hi_41666099.mp3
  example_title: Speech Example 2
- src: audios/common_voice_hi_41429198.mp3
  example_title: Speech Example 3
- src: audios/common_voice_hi_41429259.mp3
  example_title: Speech Example 4
- src: audios/common_voice_hi_40904697.mp3
  example_title: Speech Example 5
pipeline_tag: automatic-speech-recognition
license: apache-2.0
metrics:
- wer
base_model:
- openai/whisper-large-v3
library_name: transformers
---

## Whisper-Hindi2Hinglish-Prime:

### Table of Contents:
- [Key Features](#key-features)
- [Training](#training)
    - [Data](#data)
    - [Finetuning](#finetuning)
- [Usage](#usage)
- [Performance Overview](#performance-overview)
  - [Qualitative Performance Overview](#qualitative-performance-overview)
  - [Quantitative Performance Overview](#quantitative-performance-overview)
- [Miscellaneous](#miscellaneous) 

### Key Features:
1. **Hinglish as a language**: Added ability to transcribe audio into spoken Hinglish language reducing chances of grammatical errors
2. **Whisper Architecture**: Based on the whisper architecture making it easy to use with the transformers package
3. **Better Noise handling**: The model is resistant to noise and thus does not return transcriptions for audios with just noise
4. **Hallucination Mitigation**: Minimizes transcription hallucinations to enhance accuracy.
5. **Performance Increase**: ~39% average performance increase versus pretrained model across benchmarking datasets

### Training:
#### Data:
- **Duration**: A total of ~550 Hrs of noisy Indian-accented Hindi data was used to finetune the model.
- **Collection**: Due to a lack of ASR-ready hinglish datasets available, a specially curated proprietary dataset was used.
- **Labelling**: This data was then labeled using a SOTA model and the transcriptions were improved by human intervention.
- **Quality**: Emphasis was placed on collecting noisy data for the task as the intended use case of the model is in Indian environments where background noise is abundant.
- **Processing**: It was ensured that the audios are all chunked into chunks of length <30s, and there are at max 2 speakers in a clip. No further processing steps were done so as to not change the quality of the source data.

#### Finetuning:
- **Novel Trainer Architecture**: A custom trainer was written to ensure efficient supervised finetuning, with custom callbacks to enable higher observability during the training process.
- **Custom Dynamic Layer Freezing**: Most active layers were identified in the model by running inference on a subset of the training data using the pre-trained models. These layers were then kept unfrozen during the training process while all the other layers were kept frozen. This enabled faster convergence and efficient finetuning
- **Deepspeed Integration**: Deepspeed was also utilized to speed up, and optimize the training process.

### Performance Overview

#### Qualitative Performance Overview
| Audio | Whisper Large V3 | Whisper-Hindi2Hinglish-Prime |
|-------|------------------|------------------------------|
| <audio controls><source src="https://huggingface.co/Oriserve/Whisper-Hindi2Hinglish-Prime/resolve/main/audios/c0637211-7384-4abc-af69-5aacf7549824_1_2629072_2656224.wav" type="audio/wav"></audio> | maynata pura, canta maynata | Mehnat to poora karte hain. |
| <audio controls><source src="https://huggingface.co/Oriserve/Whisper-Hindi2Hinglish-Prime/resolve/main/audios/c0faba11-27ba-4837-a2eb-ccd67be07f40_1_3185088_3227568.wav" type="audio/wav"></audio> | Where did they come from? | Haan vahi ek aapko bataaya na. |
| <audio controls><source src="https://huggingface.co/Oriserve/Whisper-Hindi2Hinglish-Prime/resolve/main/audios/663eb653-d6b5-4fda-b5f2-9ef98adc0a61_0_1098400_1118688.wav" type="audio/wav"></audio> | A Pantral Logan. | Aap pandrah log hain. |
| <audio controls><source src="https://huggingface.co/Oriserve/Whisper-Hindi2Hinglish-Prime/resolve/main/audios/f5e0178c-354c-40c9-b3a7-687c86240a77_1_2613728_2630112.wav" type="audio/wav"></audio> | Thank you, Sanchez. | Kitne saal ki? |
| <audio controls><source src="https://huggingface.co/Oriserve/Whisper-Hindi2Hinglish-Prime/resolve/main/audios/f5e0178c-354c-40c9-b3a7-687c86240a77_1_1152496_1175488.wav" type="audio/wav"></audio> | Rangers, I can tell you. | Lander cycle chaahie. |
| <audio controls><source src="https://huggingface.co/Oriserve/Whisper-Hindi2Hinglish-Prime/resolve/main/audios/c0637211-7384-4abc-af69-5aacf7549824_1_2417088_2444224.wav" type="audio/wav"></audio> | Uh-huh. They can't. | Haan haan, dekhe hain. |


#### Quantitative Performance Overview

***Note***: 
- *The below WER scores are for Hinglish text generated by our model and the original whisper model*
- *To check our model's real-world performance against other SOTA models please head to our [Speech-To-Text Arena](https://huggingface.co/spaces/Oriserve/ASR_arena) arena space.*

| Dataset | Whisper Large V3 | Whisper-Hindi2Hinglish-Prime |
|-------|------------------------|-------------------------|
| [Common-Voice](https://commonvoice.mozilla.org/en) | 61.9432| 32.4314 |
| [FLEURS](https://huggingface.co/datasets/google/fleurs) | 50.8425 | 28.6806 |
| [Indic-Voices](https://ai4bharat.iitm.ac.in/datasets/indicvoices)| 82.5621 | 60.8224 |

### Usage:
#### Using Transformers
- To run the model, first install the Transformers library

```pip install -U transformers```

- The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audios of arbitrary length:

```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text
```

#### Using Flash Attention 2

Flash-Attention 2 can be used to make the transcription fast. If your GPU supports Flash-Attention you can use it by, first installing Flash Attention:

```pip install flash-attn --no-build-isolation```

- Once installed you can then load the model using the below code:

```python
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
```

#### Using the OpenAI Whisper module

- First, install the openai-whisper library

```pip install -U openai-whisper tqdm```

- Convert the huggingface checkpoint to a pytorch model

```python
import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json

# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
    reverse_translation = json.load(f)

reverse_translation = OrderedDict(reverse_translation)

def save_model(model, save_path):
    def reverse_translate(current_param):
        # Convert parameter names using regex patterns
        for pattern, repl in reverse_translation.items():
            if re.match(pattern, current_param):
                return re.sub(pattern, repl, current_param)

    # Extract model dimensions from config
    config = model.config
    model_dims = {
        "n_mels": config.num_mel_bins,           # Number of mel spectrogram bins
        "n_vocab": config.vocab_size,            # Vocabulary size
        "n_audio_ctx": config.max_source_positions,    # Max audio context length
        "n_audio_state": config.d_model,         # Audio encoder state dimension
        "n_audio_head": config.encoder_attention_heads,  # Audio encoder attention heads
        "n_audio_layer": config.encoder_layers,   # Number of audio encoder layers
        "n_text_ctx": config.max_target_positions,     # Max text context length
        "n_text_state": config.d_model,          # Text decoder state dimension
        "n_text_head": config.decoder_attention_heads,  # Text decoder attention heads
        "n_text_layer": config.decoder_layers,    # Number of text decoder layers
    }

    # Convert model state dict to Whisper format
    original_model_state_dict = model.state_dict()
    new_state_dict = {}

    for key, value in tqdm(original_model_state_dict.items()):
        key = key.replace("model.", "")          # Remove 'model.' prefix
        new_key = reverse_translate(key)         # Convert parameter names
        if new_key is not None:
            new_state_dict[new_key] = value

    # Create final model dictionary
    pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}

    # Save converted model
    torch.save(pytorch_model, save_path)

# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    low_cpu_mem_usage=True,        # Optimize memory usage
    use_safetensors=True           # Use safetensors format
)

# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Prime.pt"
save_model(model,model_save_path)
```

- Transcribe

```python
import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Prime.pt")
result = model.transcribe("sample.wav")
print(result["text"])
```


### Miscellaneous
This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our [Speech-To-Text Arena](https://huggingface.co/spaces/Oriserve/ASR_arena). To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email [[email protected]]([email protected])