File size: 2,924 Bytes
b918931 dfb92b1 b918931 e6090eb b918931 dfb92b1 b918931 dfb92b1 b918931 dfb92b1 b918931 dfb92b1 b918931 dfb92b1 b918931 dfb92b1 b918931 dfb92b1 b918931 dfb92b1 e6090eb b918931 dfb92b1 112b81c dfb92b1 e6090eb dfb92b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
license: mit
language:
- ru
pipeline_tag: automatic-speech-recognition
library_name: transformers
tags:
- asr
- gigaam
- stt
- ru
- ctc
- ngram
- audio
- speech
---
[](https://colab.research.google.com/gist/waveletdeboshir/07e39ae96f27331aa3e1e053c2c2f9e8/gigaam-ctc-hf-with-lm.ipynb)
# GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers
* original git https://github.com/salute-developers/GigaAM
* ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm)
Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding.
## Model info
This is an original GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM.
In addition it can be use to extract word-level timestamps.
File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).
## Installation
my lib versions:
* `torch` 2.5.1
* `torchaudio` 2.5.1
* `transformers` 4.49.0
You need to install `kenlm` and `pyctcdecode`:
```bash
pip install kenlm
pip install pyctcdecode
```
## Usage
Usage is same as other `transformers` ASR models.
```python
from transformers import AutoModel, AutoProcessor
import torch
import torchaudio
# load audio
wav, sr = torchaudio.load("audio.wav")
# resample if necessary
wav = torchaudio.functional.resample(wav, sr, 16000)
# load model and processor
processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model.eval()
input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
# predict
with torch.no_grad():
logits = model(**input_features).logits
# decoding with beamseach and LM (tune alpha, beta, beam_width for your data)
transcription = processor.batch_decode(
logits=logits.numpy(),
beam_width=64,
alpha=0.5,
beta=0.5,
).text[0]
```
### Decoding with timestamps
We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`.
In our case (Conformer) `MODEL_STRIDE = 40` ms per timestamp.
```python
MODEL_STRIDE = 40
outputs = processor.batch_decode(
logits=logits.numpy(),
beam_width=64,
alpha=0.5,
beta=0.5,
output_word_offsets=True
)
word_ts = [
{
"word": d["word"],
"start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2),
"end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2),
}
for d in outputs.word_offsets[0]
]
``` |