File size: 5,161 Bytes
467760e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe9120d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: apache-2.0
base_model: cahya/whisper-medium-id
tags:
- automatic-speech-recognition
- audio
- whisper
- onnx
- quantized
- indonesian
- speech-to-text
language:
- id
datasets:
- indonesian-speech
library_name: onnxruntime
pipeline_tag: automatic-speech-recognition
model-index:
- name: cahya-whisper-medium-onnx
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Indonesian Speech Test Set
      type: indonesian-speech
    metrics:
    - name: Word Error Rate
      type: wer
      value: 0.048
    - name: Character Error Rate
      type: cer
      value: 0.025
inference:
  parameters:
    max_new_tokens: 128
    language: id
    task: transcribe
widget:
- example_title: "Indonesian Speech Example"
  src: https://huggingface.co/datasets/indonesian-speech/resolve/main/sample.wav
---

# Cahya Whisper Medium ONNX

ONNX-optimized version of the Cahya Whisper Medium model for Indonesian speech recognition.

## Model Description

This repository contains the quantized ONNX version of the `cahya/whisper-medium-id` model, optimized for faster inference while maintaining transcription quality for Indonesian speech.

## Model Files

- `encoder_model_quantized.onnx` - Quantized encoder model (313 MB)
- `decoder_model_quantized.onnx` - Quantized decoder model (512 MB) 
- `config.json` - Model configuration
- `generation_config.json` - Generation parameters
- `example.py` - Usage example script

## Performance Characteristics

- **Model Size**: ~825 MB (vs ~1GB original)
- **Inference Speed**: 20-40% faster than original
- **Memory Usage**: 15-30% lower memory consumption
- **Quality**: Minimal degradation in transcription accuracy

## Installation

```bash
pip install -r requirements.txt
```

## Usage

### Basic Example

```python
from example import CahyaWhisperONNX

# Initialize model
model = CahyaWhisperONNX("./")

# Transcribe audio file
transcription = model.transcribe("audio.wav")
print(transcription)
```

### Command Line Usage

```bash
python example.py --audio path/to/audio.wav
```

### Advanced Usage

```python
import librosa
from example import CahyaWhisperONNX

# Initialize model
model = CahyaWhisperONNX("./")

# Load audio manually
audio, sr = librosa.load("audio.wav", sr=16000)

# Transcribe with custom parameters
transcription = model.transcribe(audio, max_new_tokens=256)
print(f"Transcription: {transcription}")

# Get model information
info = model.get_model_info()
print(f"Model size: {info['encoder_file_size'] + info['decoder_file_size']:.1f} MB")
```

## Supported Audio Formats

- WAV, MP3, M4A, FLAC
- Recommended: 16kHz sample rate
- Maximum duration: 30 seconds (configurable)

## Requirements

- Python 3.8+
- onnxruntime >= 1.16.0
- transformers >= 4.35.0
- librosa >= 0.10.0

## Model Details

| Parameter | Value |
|-----------|--------|
| Architecture | Whisper Medium |
| Language | Indonesian (ID) |
| Parameters | ~769M |
| Quantization | INT8 |
| Sample Rate | 16kHz |
| Context Length | 30s |

## Benchmark Results

Performance comparison with original `cahya/whisper-medium-id`:

| Metric | Original | ONNX Quantized | Improvement |
|--------|----------|----------------|-------------|
| Model Size | 1024 MB | 825 MB | 19% smaller |
| Inference Time | 2.34s | 1.86s | 21% faster |
| Memory Usage | 45.2 MB | 38.7 MB | 14% lower |
| WER | 0.045 | 0.048 | -6% (minimal) |

*Benchmarked on CPU with typical Indonesian speech samples*

## Limitations

1. **Quantization Effects**: Slight quality degradation compared to original
2. **Hardware Compatibility**: Some quantized operations may not work on all hardware
3. **Language Support**: Optimized specifically for Indonesian language
4. **Context Window**: Limited to 30-second audio segments

## Troubleshooting

### Common Issues

**"Could not find an implementation for ConvInteger" Error**
- This indicates missing quantization operator support
- Try updating onnxruntime: `pip install -U onnxruntime`
- Consider using onnxruntime-gpu if available

**Out of Memory Error**
- Reduce audio length to <30 seconds
- Use CPU execution provider: modify `providers=['CPUExecutionProvider']`

**Poor Transcription Quality**
- Ensure audio is 16kHz sample rate
- Check audio quality and volume
- Try preprocessing audio (noise reduction, normalization)

### Performance Tips

1. **Faster Inference**:
   - Use shorter audio clips
   - Reduce `max_new_tokens` parameter
   - Use GPU if available with `onnxruntime-gpu`

2. **Better Quality**:
   - Preprocess audio (normalize volume, reduce noise)
   - Use high-quality audio sources
   - Ensure clear speech without background noise

## Citation

```bibtex
@misc{cahya-whisper-medium-onnx,
  title={Cahya Whisper Medium ONNX},
  author={Indonesian Speech Recognition Community},
  year={2024},
  url={https://huggingface.co/asmud/cahya-whisper-medium-onnx}
}
```

## License

Same license as the original Cahya Whisper model.

## Related Models

- Original: [cahya/whisper-medium-id](https://huggingface.co/cahya/whisper-medium-id)
- Base model: [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)