Pathumma Whisper Large V3 (TH) — Natural Noise-Robust Finetuned (v4, LoRA)

Model Description

This model is a Thai Automatic Speech Recognition (ASR) system based on nectec/Pathumma-whisper-th-large-v3, enhanced with LoRA (Low-Rank Adaptation) fine-tuning to improve robustness in noisy environments.

It uses WhisperForConditionalGeneration with SpecAugment and gradient checkpointing to improve performance on real-world noisy and spontaneous Thai speech. Training was done on a custom dataset simulating voice messages, ambient sound, and conversational noise.

Dataset

Name: tingwry/asr-augmented
Description: Thai ASR dataset augmented with realistic background noise (e.g., voice messages, ambient environments) to simulate common recording conditions.

Quickstart

import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

lang = "th"
task = "transcribe"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned",
    device=device,
    torch_dtype=torch_dtype,
    chunk_length_s=30,
    return_timestamps=False
)

pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task=task)

audio_path = "<Your wav file>"
result = pipe(audio_path)

print("Full Transcription:\n", result["text"])

Model Architecture

Base Model: nectec/Pathumma-whisper-th-large-v3
Adapter Type: LoRA
Target Modules: q_proj, k_proj, v_proj
LoRA Config:
- r=8
- lora_alpha=32
- lora_dropout=0.1

SpecAugment

mask_time_prob = 0.2
mask_feature_prob = 0.2

Training Arguments

Epochs: 8
Learning Rate: 2e-5
Scheduler: Cosine
Warmup Ratio: 0.05
Batch Size: 4 (per device)
Precision: bf16
Optimizer: AdamW (fused)
Gradient Checkpointing: Enabled
Metric: CER
Generation Max Length: 256
Generation Beams: 5

Training Results

Epoch	Training Loss	Validation Loss	CER	WER
1	0.049300	0.022428	0.052511	0.124607
2	0.017500	0.015223	0.051452	0.100236
3	0.012900	0.012217	0.049419	0.092767
4	0.009900	0.010561	0.049024	0.091588
5	0.007500	0.010173	0.048868	0.087657
6	0.007200	0.009647	0.050930	0.086478
7	0.006700	0.009532	0.051565	0.087264
8	0.006400	0.009492	0.047598	0.086478

Evaluation Performance (in Percentage)

CER

model	samples	SEACrowd/gowajee	SEACrowd/thai_elderly_speech	fsicoli/common_voice_18_0	google/fleurs	tingwry/asr-augmented
whisper-large-v3	388	37.82	5.24	9.25	10.95	4.58
pathumma-whisper-th-large-v3-natural-noise-finetuned	388	2.18	0.84	4.73	7.21	1.3
airesearch-wav2vec2-large-xlsr-53-th	388	30.31	3.83	6.49	12.84	8.19
pathumma-whisper-th-large-v3	388	1.27	0.5	4.75	7.39	4.57
monsoon-whisper-medium-gigaspeech2	388	30.31	3.83	6.49	12.84	8.19
thonburian-whisper-th-large-v3-combined	388	8.61	0.81	5.8	7.45	2.71

WER

model	samples	SEACrowd/gowajee	SEACrowd/thai_elderly_speech	fsicoli/common_voice_18_0	google/fleurs	tingwry/asr-augmented
whisper-large-v3	388	94.1	96.91	78.84	87.97	74.12
pathumma-whisper-th-large-v3-natural-noise-finetuned	388	8.23	19.33	69.1	69.39	7.15
airesearch-wav2vec2-large-xlsr-53-th	388	99.58	38.92	67.79	99.63	100
pathumma-whisper-th-large-v3	388	4.37	5.41	80.34	71.13	90.02
monsoon-whisper-medium-gigaspeech2	388	99.58	38.92	67.79	99.63	100
thonburian-whisper-th-large-v3-combined	388	39.84	11.08	110.67	66.33	49.85

Limitations and Future Work

Trained on Thai-only speech, not multilingual
Evaluated using CER, WER ; Thai word segmentation metrics will be explored in future versions
May not generalize well to regional dialects or highly degraded audio
Future improvements may include domain adaptation (e.g., medical, legal) and dialect-specific tuning

Acknowledgements

NECTEC for the original base model
OpenAI for Whisper architecture
SuperAI Engineer Program for mentors support
ThaiSC (NSTDA Supercomputer Center) for GPU compute on LANTA cluster
Special thanks to P'Tik, P'Joe, P'Sam, P'nut and P'Earth
And The Scamper SS5 House

Built with peft==0.15.2 and transformers==4.x

PogusTheWhisper
/

Pathumma-whisper-th-large-v3-natural-noise-finetuned