Whisper Small Armenian v2: Enhanced Fine-tuning for Armenian Speech Recognition

This model is an enhanced fine-tuned version of Chillarmo/whisper-small-armenian on the Chillarmo/common_voice_20_armenian dataset. This v2 model incorporates additional training data and optimizations to achieve improved performance for Armenian automatic speech recognition tasks.

Model Details

Model Description

This is an enhanced fine-tuned Whisper model specifically optimized for Armenian speech recognition. The model builds upon a previously fine-tuned Whisper small model for Armenian and has been further trained with additional data to improve transcription accuracy and robustness for the Armenian language.

  • Developed by: Movses Movsesyan (Independent Research)
  • Model type: Automatic Speech Recognition
  • Language(s): Armenian (hy)
  • License: Apache 2.0
  • Finetuned from model: Chillarmo/whisper-small-armenian

Model Sources

Uses

Direct Use

This model can be directly used for transcribing Armenian speech to text. It's particularly well-suited for:

  • Converting Armenian audio recordings to text
  • Real-time Armenian speech transcription
  • Building Armenian voice interfaces and applications
  • Research in Armenian computational linguistics

Downstream Use

The model can be integrated into larger applications such as:

  • Voice assistants for Armenian speakers
  • Subtitle generation for Armenian media content
  • Accessibility tools for Armenian-speaking communities
  • Educational applications for Armenian language learning

Out-of-Scope Use

This model should not be used for:

  • Speech recognition in languages other than Armenian
  • Speaker identification or verification
  • Audio classification beyond speech transcription
  • Medical or legal transcription requiring 100% accuracy

Bias, Risks, and Limitations

The model may have limitations including:

  • Domain bias: Performance may vary significantly across different speaking styles, accents, and audio quality
  • Vocabulary limitations: May struggle with technical terms, proper nouns, or words not present in the training data
  • Audio quality dependency: Performance degrades with poor audio quality, background noise, or multiple speakers
  • Dialectal variations: May show bias toward specific Armenian dialects represented in the training data

Recommendations

Users should be aware of these limitations and:

  • Test the model thoroughly on their specific use case and domain
  • Implement appropriate error handling for critical applications
  • Consider human review for high-stakes transcription tasks
  • Be mindful of potential biases when deploying in diverse linguistic contexts

How to Get Started with the Model

Use the code below to get started with the model:

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

# Load the processor and model
processor = AutoProcessor.from_pretrained("Chillarmo/whisper-small-armenian-v2")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Chillarmo/whisper-small-armenian-v2")

# Process audio
def transcribe_armenian(audio_path):
    # Load and process audio file
    import librosa
    audio, sr = librosa.load(audio_path, sr=16000)
    
    # Process the audio
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(inputs["input_features"])
    
    # Decode the transcription
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcription[0]

# Example usage
# transcription = transcribe_armenian("path/to/armenian_audio.wav")
# print(transcription)

Training Details

Training Data

The model was fine-tuned on the Chillarmo/common_voice_20_armenian dataset with additional training data incorporated to enhance performance and robustness. This v2 version represents an iterative improvement over the base fine-tuned model, with expanded training data to better capture Armenian speech patterns and vocabulary.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • Training regime: Mixed precision training
  • Epochs: 5.24
  • Training runtime: 44,426 seconds (approximately 12.3 hours)
  • Training samples per second: 1.801
  • Training steps per second: 0.113
  • Final training loss: 0.076

Speeds, Sizes, Times

  • Training time: ~12.3 hours for 5000 training steps
  • Evaluation time: ~2.6 hours for evaluation
  • Evaluation samples per second: 0.624
  • Total training steps: 5,000

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on a held-out test set from the Chillarmo/common_voice_20_armenian dataset.

Metrics

The model was evaluated using standard speech recognition metrics:

  • Word Error Rate (WER): Measures the percentage of words that are incorrectly transcribed
  • Character Error Rate (CER): Measures the percentage of characters that are incorrectly transcribed
  • Exact Match: Percentage of utterances that are transcribed perfectly

Results

The fine-tuned model achieved the following performance on the evaluation set:

Metric Value
Word Error Rate (WER) 24.01%
Character Error Rate (CER) 4.77%
Exact Match 28.14%
Average Prediction Length 7.74 tokens
Average Label Length 7.77 tokens
Length Ratio 0.995

Summary

The model demonstrates strong performance for Armenian speech recognition with a relatively low character error rate of 4.77% and word error rate of 24.01%. The length ratio close to 1.0 indicates that the model generates transcriptions of appropriate length compared to the ground truth.

Technical Specifications

Model Architecture and Objective

This model is based on the Whisper architecture, which uses a Transformer encoder-decoder structure:

  • Encoder: Processes mel-spectrogram features from audio input
  • Decoder: Generates text tokens autoregressively
  • Architecture: Transformer-based sequence-to-sequence model
  • Model size: Small (244M parameters)
  • Input: 80-dimensional log mel-spectrograms
  • Output: Armenian text transcriptions

Compute Infrastructure

Hardware

Training was performed on the following hardware configuration:

  • GPU: 1x NVIDIA GeForce RTX 3060 Ti (8GB VRAM)
  • CPU: Intel Core i7-10700F
  • RAM: 32GB System Memory
  • Operating System: Windows
  • Training Environment: Local machine setup

Software

  • Framework: Hugging Face Transformers
  • Training library: PyTorch with Accelerate
  • Audio processing: librosa, soundfile
  • Evaluation: datasets, evaluate, jiwer

Citation

BibTeX:

Citation

BibTeX:

@misc{movsesyan2025whisper-armenian-v2,
  author = {Movsesyan, Movses},
  title = {Whisper Small Armenian v2},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Chillarmo/whisper-small-armenian-v2}
}

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={International Conference on Machine Learning},
  pages={28492--28518},
  year={2023},
  organization={PMLR}
}

APA:

Movsesyan, M. (2025). Whisper Small Armenian v2. Hugging Face. https://huggingface.co/Chillarmo/whisper-small-armenian-v2

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

Model Card Authors

This model card was created by Movses Movsesyan based on the fine-tuning results and model performance data.

Downloads last month
28
Safetensors
Model size
242M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chillarmo/whisper-small-armenian-v2

Finetuned
(1)
this model

Dataset used to train Chillarmo/whisper-small-armenian-v2

Evaluation results