GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers

This is an unofficial Transformers wrapper for the original GigaAM model released by SberDevices.

original git https://github.com/salute-developers/GigaAM
ngram LM from bond005/wav2vec2-large-ru-golos-with-lm

Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding.

Model info

This is GigaAM-v2-CTC with transformers library interface, beamsearch decoding and hypothesis rescoring with external ngram LM. In addition it can be use to extract word-level timestamps.

File gigaam_transformers.py contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).

Installation

my lib versions:

torch 2.7.1
torchaudio 2.7.1
transformers 4.49.0

You need to install kenlm and pyctcdecode:

pip install kenlm
pip install pyctcdecode

Usage

Usage is same as other transformers ASR models.

from transformers import AutoModel, AutoProcessor
import torch
import torchaudio

# load audio
wav, sr = torchaudio.load("audio.wav")
# resample if necessary
wav = torchaudio.functional.resample(wav, sr, 16000)

# load model and processor
processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model.eval()

input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")

# predict
with torch.no_grad():
    logits = model(**input_features).logits

# decoding with beamseach and LM (tune alpha, beta, beam_width for your data)
transcription = processor.batch_decode(
    logits=logits.numpy(),
    beam_width=64,
    alpha=0.5,
    beta=0.5,
).text[0]

Decoding with timestamps

We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter output_word_offsets=True.

In our case (Conformer) MODEL_STRIDE = 40 ms per timestamp.

MODEL_STRIDE = 40
outputs = processor.batch_decode(
    logits=logits.numpy(),
    beam_width=64,
    alpha=0.5,
    beta=0.5,
    output_word_offsets=True
)
word_ts = [
    {
        "word": d["word"],
        "start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2),
        "end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2),
    }
    for d in outputs.word_offsets[0]
]

waveletdeboshir
/

gigaam-ctc-with-lm

GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers

Model info

Installation

Usage

Decoding with timestamps

Collection including waveletdeboshir/gigaam-ctc-with-lm

GigaAM