khanusa/nd_asr_wav2vec2

This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on nguyenvulebinh/wav2vec2-base-vi.

Model Description

Language: Vietnamese
Task: Automatic Speech Recognition
Base Model: nguyenvulebinh/wav2vec2-base-vi
Architecture: Wav2Vec2 + CTC Head
Training Framework: PyTorch
Fine-tuning: Custom Vietnamese speech dataset

Usage

import torch
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")

# Load and preprocess audio
audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)

# Tokenize and predict
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Training Details

Training Data

17.36 hours custom Vietnamese speech dataset

Training Procedure

Optimizer: AdamW
Learning Rate: 5e-6
Batch Size: 8 (with gradient accumulation steps: 4)
Epochs: 50
Audio Duration: 7-11 seconds clips
Sampling Rate: 16kHz
Features: 16-bit PCM audio
Label Smoothing: 0.1

Training Configuration

Mixed Precision Training (AMP)
Gradient Clipping: 1.0
Warmup Steps: 2000
Early Stopping Patience: 8 epochs

Performance

Metric	Value
WER	0.2123

Note: Please update the WER value with your actual evaluation results.

Limitations and Bias

This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:

Different Vietnamese dialects
Noisy environments not represented in training data
Domain-specific vocabulary outside of training scope
Cross-lingual transfer limitations (base model was trained on English)
Audio quality different from training conditions

Citation

@article{baevski2020wav2vec,
  title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
  author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
  journal={Advances in neural information processing systems},
  volume={33},
  pages={12449--12460},
  year={2020}
}

License

This model is released under the Apache 2.0 License.

khanusa
/

nd_asr_wav2vec2