Pathumma Whisper Large V3 (TH) β Natural Noise-Robust Finetuned (v4, LoRA)
Model Description
This model is a Thai Automatic Speech Recognition (ASR) system based on nectec/Pathumma-whisper-th-large-v3
, enhanced with LoRA (Low-Rank Adaptation) fine-tuning to improve robustness in noisy environments.
It uses WhisperForConditionalGeneration
with SpecAugment and gradient checkpointing to improve performance on real-world noisy and spontaneous Thai speech. Training was done on a custom dataset simulating voice messages, ambient sound, and conversational noise.
Dataset
- Name:
tingwry/asr-augmented
- Description: Thai ASR dataset augmented with realistic background noise (e.g., voice messages, ambient environments) to simulate common recording conditions.
Quickstart
import torch
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
lang = "th"
task = "transcribe"
pipe = pipeline(
task="automatic-speech-recognition",
model="PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned",
device=device,
torch_dtype=torch_dtype,
chunk_length_s=30,
return_timestamps=False
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task=task)
audio_path = "<Your wav file>"
result = pipe(audio_path)
print("Full Transcription:\n", result["text"])
Model Architecture
- Base Model:
nectec/Pathumma-whisper-th-large-v3
- Adapter Type: LoRA
- Target Modules:
q_proj
,k_proj
,v_proj
- LoRA Config:
r=8
lora_alpha=32
lora_dropout=0.1
SpecAugment
mask_time_prob = 0.2
mask_feature_prob = 0.2
Training Arguments
- Epochs: 8
- Learning Rate:
2e-5
- Scheduler: Cosine
- Warmup Ratio:
0.05
- Batch Size: 4 (per device)
- Precision: bf16
- Optimizer: AdamW (fused)
- Gradient Checkpointing: Enabled
- Metric: CER
- Generation Max Length: 256
- Generation Beams: 5
Training Results
Epoch | Training Loss | Validation Loss | CER | WER |
---|---|---|---|---|
1 | 0.049300 | 0.022428 | 0.052511 | 0.124607 |
2 | 0.017500 | 0.015223 | 0.051452 | 0.100236 |
3 | 0.012900 | 0.012217 | 0.049419 | 0.092767 |
4 | 0.009900 | 0.010561 | 0.049024 | 0.091588 |
5 | 0.007500 | 0.010173 | 0.048868 | 0.087657 |
6 | 0.007200 | 0.009647 | 0.050930 | 0.086478 |
7 | 0.006700 | 0.009532 | 0.051565 | 0.087264 |
8 | 0.006400 | 0.009492 | 0.047598 | 0.086478 |
Evaluation Performance (in Percentage)
CER
model | samples | SEACrowd/gowajee | SEACrowd/thai_elderly_speech | fsicoli/common_voice_18_0 | google/fleurs | tingwry/asr-augmented |
---|---|---|---|---|---|---|
whisper-large-v3 | 388 | 37.82 | 5.24 | 9.25 | 10.95 | 4.58 |
pathumma-whisper-th-large-v3-natural-noise-finetuned | 388 | 2.18 | 0.84 | 4.73 | 7.21 | 1.3 |
airesearch-wav2vec2-large-xlsr-53-th | 388 | 30.31 | 3.83 | 6.49 | 12.84 | 8.19 |
pathumma-whisper-th-large-v3 | 388 | 1.27 | 0.5 | 4.75 | 7.39 | 4.57 |
monsoon-whisper-medium-gigaspeech2 | 388 | 30.31 | 3.83 | 6.49 | 12.84 | 8.19 |
thonburian-whisper-th-large-v3-combined | 388 | 8.61 | 0.81 | 5.8 | 7.45 | 2.71 |
WER
model | samples | SEACrowd/gowajee | SEACrowd/thai_elderly_speech | fsicoli/common_voice_18_0 | google/fleurs | tingwry/asr-augmented |
---|---|---|---|---|---|---|
whisper-large-v3 | 388 | 94.1 | 96.91 | 78.84 | 87.97 | 74.12 |
pathumma-whisper-th-large-v3-natural-noise-finetuned | 388 | 8.23 | 19.33 | 69.1 | 69.39 | 7.15 |
airesearch-wav2vec2-large-xlsr-53-th | 388 | 99.58 | 38.92 | 67.79 | 99.63 | 100 |
pathumma-whisper-th-large-v3 | 388 | 4.37 | 5.41 | 80.34 | 71.13 | 90.02 |
monsoon-whisper-medium-gigaspeech2 | 388 | 99.58 | 38.92 | 67.79 | 99.63 | 100 |
thonburian-whisper-th-large-v3-combined | 388 | 39.84 | 11.08 | 110.67 | 66.33 | 49.85 |
Limitations and Future Work
- Trained on Thai-only speech, not multilingual
- Evaluated using CER, WER ; Thai word segmentation metrics will be explored in future versions
- May not generalize well to regional dialects or highly degraded audio
- Future improvements may include domain adaptation (e.g., medical, legal) and dialect-specific tuning
Acknowledgements
- NECTEC for the original base model
- OpenAI for Whisper architecture
- SuperAI Engineer Program for mentors support
- ThaiSC (NSTDA Supercomputer Center) for GPU compute on LANTA cluster
- Special thanks to P'Tik, P'Joe, P'Sam, P'nut and P'Earth
- And The Scamper SS5 House
Built with peft==0.15.2
and transformers==4.x
- Downloads last month
- 0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
1
Ask for provider support
Model tree for PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned
Base model
openai/whisper-large-v3
Finetuned
nectec/Pathumma-whisper-th-large-v3