--- language: vi license: apache-2.0 base_model: nguyenvulebinh/wav2vec2-base-vi tags: - wav2vec2 - automatic-speech-recognition - speech - audio - vietnamese - pytorch - CTC datasets: - custom-vietnamese-speech metrics: - wer model-index: - name: khanusa/nd_asr_wav2vec2 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Custom Vietnamese Speech Dataset type: custom metrics: - name: WER type: wer value: "TBD" # Update with your actual WER score --- # khanusa/nd_asr_wav2vec2 This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on `nguyenvulebinh/wav2vec2-base-vi`. ## Model Description - **Language:** Vietnamese - **Task:** Automatic Speech Recognition - **Base Model:** nguyenvulebinh/wav2vec2-base-vi - **Architecture:** Wav2Vec2 + CTC Head - **Training Framework:** PyTorch - **Fine-tuning:** Custom Vietnamese speech dataset ## Usage ```python import torch import librosa from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor # Load model and processor processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2") model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2") # Load and preprocess audio audio, sr = librosa.load("path_to_your_audio.wav", sr=16000) # Tokenize and predict inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values).logits # Decode predictions predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)[0] print(transcription) ``` ## Training Details ### Training Data 17.36 hours custom Vietnamese speech dataset ### Training Procedure - **Optimizer:** AdamW - **Learning Rate:** 5e-6 - **Batch Size:** 8 (with gradient accumulation steps: 4) - **Epochs:** 50 - **Audio Duration:** 7-11 seconds clips - **Sampling Rate:** 16kHz - **Features:** 16-bit PCM audio - **Label Smoothing:** 0.1 ### Training Configuration - Mixed Precision Training (AMP) - Gradient Clipping: 1.0 - Warmup Steps: 2000 - Early Stopping Patience: 8 epochs ## Performance | Metric | Value | |--------|-------| | WER | 0.2123 | *Note: Please update the WER value with your actual evaluation results.* ## Limitations and Bias This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to: - Different Vietnamese dialects - Noisy environments not represented in training data - Domain-specific vocabulary outside of training scope - Cross-lingual transfer limitations (base model was trained on English) - Audio quality different from training conditions ## Citation ```bibtex @article{baevski2020wav2vec, title={wav2vec 2.0: A framework for self-supervised learning of speech representations}, author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael}, journal={Advances in neural information processing systems}, volume={33}, pages={12449--12460}, year={2020} } ``` ## License This model is released under the Apache 2.0 License.