khanusa/nd_asr_wav2vec2

This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on nguyenvulebinh/wav2vec2-base-vi.

Model Description

  • Language: Vietnamese
  • Task: Automatic Speech Recognition
  • Base Model: nguyenvulebinh/wav2vec2-base-vi
  • Architecture: Wav2Vec2 + CTC Head
  • Training Framework: PyTorch
  • Fine-tuning: Custom Vietnamese speech dataset

Usage

import torch
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")

# Load and preprocess audio
audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)

# Tokenize and predict
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Training Details

Training Data

17.36 hours custom Vietnamese speech dataset

Training Procedure

  • Optimizer: AdamW
  • Learning Rate: 5e-6
  • Batch Size: 8 (with gradient accumulation steps: 4)
  • Epochs: 50
  • Audio Duration: 7-11 seconds clips
  • Sampling Rate: 16kHz
  • Features: 16-bit PCM audio
  • Label Smoothing: 0.1

Training Configuration

  • Mixed Precision Training (AMP)
  • Gradient Clipping: 1.0
  • Warmup Steps: 2000
  • Early Stopping Patience: 8 epochs

Performance

Metric Value
WER 0.2123

Note: Please update the WER value with your actual evaluation results.

Limitations and Bias

This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:

  • Different Vietnamese dialects
  • Noisy environments not represented in training data
  • Domain-specific vocabulary outside of training scope
  • Cross-lingual transfer limitations (base model was trained on English)
  • Audio quality different from training conditions

Citation

@article{baevski2020wav2vec,
  title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
  author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
  journal={Advances in neural information processing systems},
  volume={33},
  pages={12449--12460},
  year={2020}
}

License

This model is released under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for khanusa/nd_asr_wav2vec2

Finetuned
(11)
this model

Evaluation results