Whisper Small Armenian v2: Enhanced Fine-tuning for Armenian Speech Recognition

This model is an enhanced fine-tuned version of Chillarmo/whisper-small-armenian on the Chillarmo/common_voice_20_armenian dataset. This v2 model incorporates additional training data and optimizations to achieve improved performance for Armenian automatic speech recognition tasks.

Model Details

Model Description

This is an enhanced fine-tuned Whisper model specifically optimized for Armenian speech recognition. The model builds upon a previously fine-tuned Whisper small model for Armenian and has been further trained with additional data to improve transcription accuracy and robustness for the Armenian language.

Developed by: Movses Movsesyan (Independent Research)
Model type: Automatic Speech Recognition
Language(s): Armenian (hy)
License: Apache 2.0
Finetuned from model: Chillarmo/whisper-small-armenian

Model Sources

Repository: Hugging Face Model Hub
Base Model: OpenAI Whisper
Paper: Robust Speech Recognition via Large-Scale Weak Supervision

Uses

Direct Use

This model can be directly used for transcribing Armenian speech to text. It's particularly well-suited for:

Converting Armenian audio recordings to text
Real-time Armenian speech transcription
Building Armenian voice interfaces and applications
Research in Armenian computational linguistics

Downstream Use

The model can be integrated into larger applications such as:

Voice assistants for Armenian speakers
Subtitle generation for Armenian media content
Accessibility tools for Armenian-speaking communities
Educational applications for Armenian language learning

Out-of-Scope Use

This model should not be used for:

Speech recognition in languages other than Armenian
Speaker identification or verification
Audio classification beyond speech transcription
Medical or legal transcription requiring 100% accuracy

Bias, Risks, and Limitations

The model may have limitations including:

Domain bias: Performance may vary significantly across different speaking styles, accents, and audio quality
Vocabulary limitations: May struggle with technical terms, proper nouns, or words not present in the training data
Audio quality dependency: Performance degrades with poor audio quality, background noise, or multiple speakers
Dialectal variations: May show bias toward specific Armenian dialects represented in the training data

Recommendations

Users should be aware of these limitations and:

Test the model thoroughly on their specific use case and domain
Implement appropriate error handling for critical applications
Consider human review for high-stakes transcription tasks
Be mindful of potential biases when deploying in diverse linguistic contexts

How to Get Started with the Model

Use the code below to get started with the model:

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

# Load the processor and model
processor = AutoProcessor.from_pretrained("Chillarmo/whisper-small-armenian-v2")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Chillarmo/whisper-small-armenian-v2")

# Process audio
def transcribe_armenian(audio_path):
    # Load and process audio file
    import librosa
    audio, sr = librosa.load(audio_path, sr=16000)
    
    # Process the audio
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(inputs["input_features"])
    
    # Decode the transcription
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcription[0]

# Example usage
# transcription = transcribe_armenian("path/to/armenian_audio.wav")
# print(transcription)

Training Details

Training Data

The model was fine-tuned on the Chillarmo/common_voice_20_armenian dataset with additional training data incorporated to enhance performance and robustness. This v2 version represents an iterative improvement over the base fine-tuned model, with expanded training data to better capture Armenian speech patterns and vocabulary.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Training regime: Mixed precision training
Epochs: 5.24
Training runtime: 44,426 seconds (approximately 12.3 hours)
Training samples per second: 1.801
Training steps per second: 0.113
Final training loss: 0.076

Speeds, Sizes, Times

Training time: ~12.3 hours for 5000 training steps
Evaluation time: ~2.6 hours for evaluation
Evaluation samples per second: 0.624
Total training steps: 5,000

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on a held-out test set from the Chillarmo/common_voice_20_armenian dataset.

Metrics

The model was evaluated using standard speech recognition metrics:

Word Error Rate (WER): Measures the percentage of words that are incorrectly transcribed
Character Error Rate (CER): Measures the percentage of characters that are incorrectly transcribed
Exact Match: Percentage of utterances that are transcribed perfectly

Results

The fine-tuned model achieved the following performance on the evaluation set:

Metric	Value
Word Error Rate (WER)	24.01%
Character Error Rate (CER)	4.77%
Exact Match	28.14%
Average Prediction Length	7.74 tokens
Average Label Length	7.77 tokens
Length Ratio	0.995

Summary

The model demonstrates strong performance for Armenian speech recognition with a relatively low character error rate of 4.77% and word error rate of 24.01%. The length ratio close to 1.0 indicates that the model generates transcriptions of appropriate length compared to the ground truth.

Technical Specifications

Model Architecture and Objective

This model is based on the Whisper architecture, which uses a Transformer encoder-decoder structure:

Encoder: Processes mel-spectrogram features from audio input
Decoder: Generates text tokens autoregressively
Architecture: Transformer-based sequence-to-sequence model
Model size: Small (244M parameters)
Input: 80-dimensional log mel-spectrograms
Output: Armenian text transcriptions

Compute Infrastructure

Hardware

Training was performed on the following hardware configuration:

GPU: 1x NVIDIA GeForce RTX 3060 Ti (8GB VRAM)
CPU: Intel Core i7-10700F
RAM: 32GB System Memory
Operating System: Windows
Training Environment: Local machine setup

Software

Framework: Hugging Face Transformers
Training library: PyTorch with Accelerate
Audio processing: librosa, soundfile
Evaluation: datasets, evaluate, jiwer

Citation

BibTeX:

Citation

BibTeX:

@misc{movsesyan2025whisper-armenian-v2,
  author = {Movsesyan, Movses},
  title = {Whisper Small Armenian v2},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Chillarmo/whisper-small-armenian-v2}
}

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={International Conference on Machine Learning},
  pages={28492--28518},
  year={2023},
  organization={PMLR}
}

APA:

Movsesyan, M. (2025). Whisper Small Armenian v2. Hugging Face. https://huggingface.co/Chillarmo/whisper-small-armenian-v2

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

Model Card Authors

This model card was created by Movses Movsesyan based on the fine-tuning results and model performance data.

Chillarmo
/

whisper-small-armenian-v2

Whisper Small Armenian v2: Enhanced Fine-tuning for Armenian Speech Recognition

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Training Hyperparameters

Speeds, Sizes, Times

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Results

Summary

Technical Specifications

Model Architecture and Objective

Compute Infrastructure

Hardware

Software

Citation

Citation

Model Card Authors

Model tree for Chillarmo/whisper-small-armenian-v2

Dataset used to train Chillarmo/whisper-small-armenian-v2

Evaluation results