FastConformer-Hybrid-ARM-ASR

This model is a fine-tuned version of nvidia/stt_hy_fastconformer_hybrid_large_pc for Automatic Speech Recognition (ASR) in Armenian.

It was trained on the Mozilla Common Voice 20.0 dataset (hy-AM) using the NVIDIA NeMo toolkit.


Model Architecture

This model uses the FastConformer-Hybrid encoder, which combines:

  • Self-attention layers (like transformers) for global context modeling
  • Convolutional modules for capturing local patterns efficiently

For decoding, the model uses:

  • Transducer (RNN-T) decoder — the main inference component
  • Auxiliary CTC loss — used only during training to improve alignment and convergence

During inference (transcribe()), only the Transducer decoder is used, and its performance is what defines the model’s WER.


Training Configuration

  • Base model: nvidia/stt_hy_fastconformer_hybrid_large_pc
  • Dataset: Common Voice 20.0 (hy-AM)
  • Epochs: 20
  • Batch size: 32 (train), 16 (val/test)
  • Audio: 16kHz mono WAVs
  • Tokenizer: BPE (Byte-Pair Encoding) — same as base model
  • Augmentation: SpecAugment
  • Loss: Transducer + auxiliary CTC (ctc_loss_weight: 0.3)
  • Optimizer: AdamW with cosine annealing
  • Precision: Mixed 16-bit (fp16)

Evaluation

Evaluated on the Common Voice 20.0 Armenian (hy-AM) test split:

Decoder Used WER (%)
Transducer 8.47

The model improves over the base model’s original WER of 9.90%, achieving a ~14% relative improvement.


Files

File / Folder Description
fastconformer-hybrid-arm-asr.nemo The fine-tuned ASR model checkpoint
config.yaml NeMo training configuration used to fine-tune
tokenizer/tokenizer.model SentencePiece BPE tokenizer model
tokenizer/vocab.txt Vocabulary used for decoding
tokenizer/tokenizer.vocab NeMo-compatible tokenizer vocabulary

Usage Example

from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel

model = EncDecHybridRNNTCTCBPEModel.restore_from("fastconformer-hybrid-arm-asr.nemo")
transcription = model.transcribe(["path_to_audio.wav"])
print(transcription[0])

The input audio must be a 16kHz mono WAV file. Other formats may result in degraded transcription quality or runtime errors.

Reproducibility

To fine-tune this model or adapt it to new datasets, you can reuse the included config.yaml. It defines:

  • Dataset loading – Manifest paths, sampling rate, bucketing, batch sizes
  • Model architecture – FastConformer encoder, RNNT decoder, joint module, auxiliary CTC decoder
  • Tokenizer setup – BPE tokenizer (tokenizer.model, vocab.txt, tokenizer.vocab)
  • Loss functions – Transducer (RNNT) as main loss + auxiliary CTC (ctc_loss_weight = 0.3)
  • Optimizer & scheduler – AdamW optimizer with cosine annealing scheduler
  • Logging & checkpointing – NeMo's exp_manager with optional checkpoint saving
Downloads last month
222
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for milenabazoyan/fastconformer-hybrid-arm-asr

Finetuned
(1)
this model

Dataset used to train milenabazoyan/fastconformer-hybrid-arm-asr

Evaluation results