khanusa/nd_asr_wav2vec2
This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on nguyenvulebinh/wav2vec2-base-vi
.
Model Description
- Language: Vietnamese
- Task: Automatic Speech Recognition
- Base Model: nguyenvulebinh/wav2vec2-base-vi
- Architecture: Wav2Vec2 + CTC Head
- Training Framework: PyTorch
- Fine-tuning: Custom Vietnamese speech dataset
Usage
import torch
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")
# Load and preprocess audio
audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)
# Tokenize and predict
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
Training Details
Training Data
17.36 hours custom Vietnamese speech dataset
Training Procedure
- Optimizer: AdamW
- Learning Rate: 5e-6
- Batch Size: 8 (with gradient accumulation steps: 4)
- Epochs: 50
- Audio Duration: 7-11 seconds clips
- Sampling Rate: 16kHz
- Features: 16-bit PCM audio
- Label Smoothing: 0.1
Training Configuration
- Mixed Precision Training (AMP)
- Gradient Clipping: 1.0
- Warmup Steps: 2000
- Early Stopping Patience: 8 epochs
Performance
Metric | Value |
---|---|
WER | 0.2123 |
Note: Please update the WER value with your actual evaluation results.
Limitations and Bias
This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:
- Different Vietnamese dialects
- Noisy environments not represented in training data
- Domain-specific vocabulary outside of training scope
- Cross-lingual transfer limitations (base model was trained on English)
- Audio quality different from training conditions
Citation
@article{baevski2020wav2vec,
title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
journal={Advances in neural information processing systems},
volume={33},
pages={12449--12460},
year={2020}
}
License
This model is released under the Apache 2.0 License.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for khanusa/nd_asr_wav2vec2
Base model
nguyenvulebinh/wav2vec2-base-viEvaluation results
- WER on Custom Vietnamese Speech Datasetself-reportedTBD