Whisper Small Armenian v2: Enhanced Fine-tuning for Armenian Speech Recognition
This model is an enhanced fine-tuned version of Chillarmo/whisper-small-armenian on the Chillarmo/common_voice_20_armenian dataset. This v2 model incorporates additional training data and optimizations to achieve improved performance for Armenian automatic speech recognition tasks.
Model Details
Model Description
This is an enhanced fine-tuned Whisper model specifically optimized for Armenian speech recognition. The model builds upon a previously fine-tuned Whisper small model for Armenian and has been further trained with additional data to improve transcription accuracy and robustness for the Armenian language.
- Developed by: Movses Movsesyan (Independent Research)
- Model type: Automatic Speech Recognition
- Language(s): Armenian (hy)
- License: Apache 2.0
- Finetuned from model: Chillarmo/whisper-small-armenian
Model Sources
- Repository: Hugging Face Model Hub
- Base Model: OpenAI Whisper
- Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Uses
Direct Use
This model can be directly used for transcribing Armenian speech to text. It's particularly well-suited for:
- Converting Armenian audio recordings to text
- Real-time Armenian speech transcription
- Building Armenian voice interfaces and applications
- Research in Armenian computational linguistics
Downstream Use
The model can be integrated into larger applications such as:
- Voice assistants for Armenian speakers
- Subtitle generation for Armenian media content
- Accessibility tools for Armenian-speaking communities
- Educational applications for Armenian language learning
Out-of-Scope Use
This model should not be used for:
- Speech recognition in languages other than Armenian
- Speaker identification or verification
- Audio classification beyond speech transcription
- Medical or legal transcription requiring 100% accuracy
Bias, Risks, and Limitations
The model may have limitations including:
- Domain bias: Performance may vary significantly across different speaking styles, accents, and audio quality
- Vocabulary limitations: May struggle with technical terms, proper nouns, or words not present in the training data
- Audio quality dependency: Performance degrades with poor audio quality, background noise, or multiple speakers
- Dialectal variations: May show bias toward specific Armenian dialects represented in the training data
Recommendations
Users should be aware of these limitations and:
- Test the model thoroughly on their specific use case and domain
- Implement appropriate error handling for critical applications
- Consider human review for high-stakes transcription tasks
- Be mindful of potential biases when deploying in diverse linguistic contexts
How to Get Started with the Model
Use the code below to get started with the model:
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
# Load the processor and model
processor = AutoProcessor.from_pretrained("Chillarmo/whisper-small-armenian-v2")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Chillarmo/whisper-small-armenian-v2")
# Process audio
def transcribe_armenian(audio_path):
# Load and process audio file
import librosa
audio, sr = librosa.load(audio_path, sr=16000)
# Process the audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(inputs["input_features"])
# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
return transcription[0]
# Example usage
# transcription = transcribe_armenian("path/to/armenian_audio.wav")
# print(transcription)
Training Details
Training Data
The model was fine-tuned on the Chillarmo/common_voice_20_armenian dataset with additional training data incorporated to enhance performance and robustness. This v2 version represents an iterative improvement over the base fine-tuned model, with expanded training data to better capture Armenian speech patterns and vocabulary.
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- Training regime: Mixed precision training
- Epochs: 5.24
- Training runtime: 44,426 seconds (approximately 12.3 hours)
- Training samples per second: 1.801
- Training steps per second: 0.113
- Final training loss: 0.076
Speeds, Sizes, Times
- Training time: ~12.3 hours for 5000 training steps
- Evaluation time: ~2.6 hours for evaluation
- Evaluation samples per second: 0.624
- Total training steps: 5,000
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on a held-out test set from the Chillarmo/common_voice_20_armenian dataset.
Metrics
The model was evaluated using standard speech recognition metrics:
- Word Error Rate (WER): Measures the percentage of words that are incorrectly transcribed
- Character Error Rate (CER): Measures the percentage of characters that are incorrectly transcribed
- Exact Match: Percentage of utterances that are transcribed perfectly
Results
The fine-tuned model achieved the following performance on the evaluation set:
Metric | Value |
---|---|
Word Error Rate (WER) | 24.01% |
Character Error Rate (CER) | 4.77% |
Exact Match | 28.14% |
Average Prediction Length | 7.74 tokens |
Average Label Length | 7.77 tokens |
Length Ratio | 0.995 |
Summary
The model demonstrates strong performance for Armenian speech recognition with a relatively low character error rate of 4.77% and word error rate of 24.01%. The length ratio close to 1.0 indicates that the model generates transcriptions of appropriate length compared to the ground truth.
Technical Specifications
Model Architecture and Objective
This model is based on the Whisper architecture, which uses a Transformer encoder-decoder structure:
- Encoder: Processes mel-spectrogram features from audio input
- Decoder: Generates text tokens autoregressively
- Architecture: Transformer-based sequence-to-sequence model
- Model size: Small (244M parameters)
- Input: 80-dimensional log mel-spectrograms
- Output: Armenian text transcriptions
Compute Infrastructure
Hardware
Training was performed on the following hardware configuration:
- GPU: 1x NVIDIA GeForce RTX 3060 Ti (8GB VRAM)
- CPU: Intel Core i7-10700F
- RAM: 32GB System Memory
- Operating System: Windows
- Training Environment: Local machine setup
Software
- Framework: Hugging Face Transformers
- Training library: PyTorch with Accelerate
- Audio processing: librosa, soundfile
- Evaluation: datasets, evaluate, jiwer
Citation
BibTeX:
Citation
BibTeX:
@misc{movsesyan2025whisper-armenian-v2,
author = {Movsesyan, Movses},
title = {Whisper Small Armenian v2},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Chillarmo/whisper-small-armenian-v2}
}
@article{radford2022robust,
title={Robust speech recognition via large-scale weak supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={International Conference on Machine Learning},
pages={28492--28518},
year={2023},
organization={PMLR}
}
APA:
Movsesyan, M. (2025). Whisper Small Armenian v2. Hugging Face. https://huggingface.co/Chillarmo/whisper-small-armenian-v2
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.
Model Card Authors
This model card was created by Movses Movsesyan based on the fine-tuning results and model performance data.
- Downloads last month
- 28
Model tree for Chillarmo/whisper-small-armenian-v2
Dataset used to train Chillarmo/whisper-small-armenian-v2
Evaluation results
- Word Error Rate on Common Voice 20 Armenianself-reported24.010
- Character Error Rate on Common Voice 20 Armenianself-reported4.770
- Exact Match on Common Voice 20 Armenianself-reported28.140