---
language: vi
license: apache-2.0
base_model: nguyenvulebinh/wav2vec2-base-vi
tags:
- wav2vec2
- automatic-speech-recognition
- speech
- audio
- vietnamese
- pytorch
- CTC
datasets:
- custom-vietnamese-speech
metrics:
- wer
model-index:
- name: khanusa/nd_asr_wav2vec2
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Custom Vietnamese Speech Dataset
      type: custom
    metrics:
    - name: WER
      type: wer
      value: "TBD"  # Update with your actual WER score
---

# khanusa/nd_asr_wav2vec2

This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on `nguyenvulebinh/wav2vec2-base-vi`.

## Model Description

- **Language:** Vietnamese
- **Task:** Automatic Speech Recognition
- **Base Model:** nguyenvulebinh/wav2vec2-base-vi
- **Architecture:** Wav2Vec2 + CTC Head
- **Training Framework:** PyTorch
- **Fine-tuning:** Custom Vietnamese speech dataset

## Usage

```python
import torch
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")

# Load and preprocess audio
audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)

# Tokenize and predict
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
```

## Training Details

### Training Data
17.36 hours custom Vietnamese speech dataset

### Training Procedure
- **Optimizer:** AdamW
- **Learning Rate:** 5e-6
- **Batch Size:** 8 (with gradient accumulation steps: 4)
- **Epochs:** 50
- **Audio Duration:** 7-11 seconds clips
- **Sampling Rate:** 16kHz
- **Features:** 16-bit PCM audio
- **Label Smoothing:** 0.1

### Training Configuration
- Mixed Precision Training (AMP)
- Gradient Clipping: 1.0
- Warmup Steps: 2000
- Early Stopping Patience: 8 epochs

## Performance

| Metric | Value |
|--------|-------|
| WER    | 0.2123   |

*Note: Please update the WER value with your actual evaluation results.*

## Limitations and Bias

This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:
- Different Vietnamese dialects
- Noisy environments not represented in training data
- Domain-specific vocabulary outside of training scope
- Cross-lingual transfer limitations (base model was trained on English)
- Audio quality different from training conditions

## Citation

```bibtex
@article{baevski2020wav2vec,
  title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
  author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
  journal={Advances in neural information processing systems},
  volume={33},
  pages={12449--12460},
  year={2020}
}


```

## License

This model is released under the Apache 2.0 License.