Pathumma Whisper Large V3 (TH) β€” Natural Noise-Robust Finetuned (v4, LoRA)

Model Description

This model is a Thai Automatic Speech Recognition (ASR) system based on nectec/Pathumma-whisper-th-large-v3, enhanced with LoRA (Low-Rank Adaptation) fine-tuning to improve robustness in noisy environments.

It uses WhisperForConditionalGeneration with SpecAugment and gradient checkpointing to improve performance on real-world noisy and spontaneous Thai speech. Training was done on a custom dataset simulating voice messages, ambient sound, and conversational noise.


Dataset

  • Name: tingwry/asr-augmented
  • Description: Thai ASR dataset augmented with realistic background noise (e.g., voice messages, ambient environments) to simulate common recording conditions.

Quickstart

import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

lang = "th"
task = "transcribe"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned",
    device=device,
    torch_dtype=torch_dtype,
    chunk_length_s=30,
    return_timestamps=False
)

pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task=task)

audio_path = "<Your wav file>"
result = pipe(audio_path)

print("Full Transcription:\n", result["text"])

Model Architecture

  • Base Model: nectec/Pathumma-whisper-th-large-v3
  • Adapter Type: LoRA
  • Target Modules: q_proj, k_proj, v_proj
  • LoRA Config:
    • r=8
    • lora_alpha=32
    • lora_dropout=0.1

SpecAugment

  • mask_time_prob = 0.2
  • mask_feature_prob = 0.2

Training Arguments

  • Epochs: 8
  • Learning Rate: 2e-5
  • Scheduler: Cosine
  • Warmup Ratio: 0.05
  • Batch Size: 4 (per device)
  • Precision: bf16
  • Optimizer: AdamW (fused)
  • Gradient Checkpointing: Enabled
  • Metric: CER
  • Generation Max Length: 256
  • Generation Beams: 5

Training Results

Epoch Training Loss Validation Loss CER WER
1 0.049300 0.022428 0.052511 0.124607
2 0.017500 0.015223 0.051452 0.100236
3 0.012900 0.012217 0.049419 0.092767
4 0.009900 0.010561 0.049024 0.091588
5 0.007500 0.010173 0.048868 0.087657
6 0.007200 0.009647 0.050930 0.086478
7 0.006700 0.009532 0.051565 0.087264
8 0.006400 0.009492 0.047598 0.086478

Evaluation Performance (in Percentage)

CER

model samples SEACrowd/gowajee SEACrowd/thai_elderly_speech fsicoli/common_voice_18_0 google/fleurs tingwry/asr-augmented
whisper-large-v3 388 37.82 5.24 9.25 10.95 4.58
pathumma-whisper-th-large-v3-natural-noise-finetuned 388 2.18 0.84 4.73 7.21 1.3
airesearch-wav2vec2-large-xlsr-53-th 388 30.31 3.83 6.49 12.84 8.19
pathumma-whisper-th-large-v3 388 1.27 0.5 4.75 7.39 4.57
monsoon-whisper-medium-gigaspeech2 388 30.31 3.83 6.49 12.84 8.19
thonburian-whisper-th-large-v3-combined 388 8.61 0.81 5.8 7.45 2.71

WER

model samples SEACrowd/gowajee SEACrowd/thai_elderly_speech fsicoli/common_voice_18_0 google/fleurs tingwry/asr-augmented
whisper-large-v3 388 94.1 96.91 78.84 87.97 74.12
pathumma-whisper-th-large-v3-natural-noise-finetuned 388 8.23 19.33 69.1 69.39 7.15
airesearch-wav2vec2-large-xlsr-53-th 388 99.58 38.92 67.79 99.63 100
pathumma-whisper-th-large-v3 388 4.37 5.41 80.34 71.13 90.02
monsoon-whisper-medium-gigaspeech2 388 99.58 38.92 67.79 99.63 100
thonburian-whisper-th-large-v3-combined 388 39.84 11.08 110.67 66.33 49.85

Limitations and Future Work

  • Trained on Thai-only speech, not multilingual
  • Evaluated using CER, WER ; Thai word segmentation metrics will be explored in future versions
  • May not generalize well to regional dialects or highly degraded audio
  • Future improvements may include domain adaptation (e.g., medical, legal) and dialect-specific tuning

Acknowledgements

  • NECTEC for the original base model
  • OpenAI for Whisper architecture
  • SuperAI Engineer Program for mentors support
  • ThaiSC (NSTDA Supercomputer Center) for GPU compute on LANTA cluster
  • Special thanks to P'Tik, P'Joe, P'Sam, P'nut and P'Earth
  • And The Scamper SS5 House

Built with peft==0.15.2 and transformers==4.x

Downloads last month
0
Safetensors
Model size
1.54B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned

Adapter
(2)
this model

Dataset used to train PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned