FastConformer-Hybrid-ARM-ASR
This model is a fine-tuned version of nvidia/stt_hy_fastconformer_hybrid_large_pc for Automatic Speech Recognition (ASR) in Armenian.
It was trained on the Mozilla Common Voice 20.0 dataset (hy-AM
) using the NVIDIA NeMo toolkit.
Model Architecture
This model uses the FastConformer-Hybrid encoder, which combines:
- Self-attention layers (like transformers) for global context modeling
- Convolutional modules for capturing local patterns efficiently
For decoding, the model uses:
- Transducer (RNN-T) decoder — the main inference component
- Auxiliary CTC loss — used only during training to improve alignment and convergence
During inference (transcribe()
), only the Transducer decoder is used, and its performance is what defines the model’s WER.
Training Configuration
- Base model:
nvidia/stt_hy_fastconformer_hybrid_large_pc
- Dataset: Common Voice 20.0 (
hy-AM
) - Epochs: 20
- Batch size: 32 (train), 16 (val/test)
- Audio: 16kHz mono WAVs
- Tokenizer: BPE (Byte-Pair Encoding) — same as base model
- Augmentation: SpecAugment
- Loss: Transducer + auxiliary CTC (
ctc_loss_weight: 0.3
) - Optimizer: AdamW with cosine annealing
- Precision: Mixed 16-bit (fp16)
Evaluation
Evaluated on the Common Voice 20.0 Armenian (hy-AM
) test split:
Decoder Used | WER (%) |
---|---|
Transducer | 8.47 |
The model improves over the base model’s original WER of 9.90%, achieving a ~14% relative improvement.
Files
File / Folder | Description |
---|---|
fastconformer-hybrid-arm-asr.nemo |
The fine-tuned ASR model checkpoint |
config.yaml |
NeMo training configuration used to fine-tune |
tokenizer/tokenizer.model |
SentencePiece BPE tokenizer model |
tokenizer/vocab.txt |
Vocabulary used for decoding |
tokenizer/tokenizer.vocab |
NeMo-compatible tokenizer vocabulary |
Usage Example
from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel
model = EncDecHybridRNNTCTCBPEModel.restore_from("fastconformer-hybrid-arm-asr.nemo")
transcription = model.transcribe(["path_to_audio.wav"])
print(transcription[0])
The input audio must be a 16kHz mono WAV file. Other formats may result in degraded transcription quality or runtime errors.
Reproducibility
To fine-tune this model or adapt it to new datasets, you can reuse the included config.yaml
. It defines:
- Dataset loading – Manifest paths, sampling rate, bucketing, batch sizes
- Model architecture – FastConformer encoder, RNNT decoder, joint module, auxiliary CTC decoder
- Tokenizer setup – BPE tokenizer (
tokenizer.model
,vocab.txt
,tokenizer.vocab
) - Loss functions – Transducer (RNNT) as main loss + auxiliary CTC (
ctc_loss_weight = 0.3
) - Optimizer & scheduler – AdamW optimizer with cosine annealing scheduler
- Logging & checkpointing – NeMo's
exp_manager
with optional checkpoint saving
- Downloads last month
- 222
Model tree for milenabazoyan/fastconformer-hybrid-arm-asr
Base model
nvidia/stt_hy_fastconformer_hybrid_large_pc