wav2vec2-xlsr-53-ft-cy-en-withlm
An acoustic encoder model for Welsh and English speech recognition accompanied with a n-gram language model. The acoustic model is fine-tuned from facebook/wav2vec2-large-xlsr-53 using transcribed spontaneous speech from techiaith/banc-trawsgrifiadau-bangor (v24.01) and Welsh and English speech data derived from version 16.1 the Common Voice datasets techiaith/commonvoice_16_1_en_cy
The accompanying language model is a single KenLM n-gram model trained with a balanced collection of Welsh and English texts from OSCAR, thus avoiding language specific models and language detection during CTC decoding.
Usage
The wav2vec2-xlsr-53-ft-cy-en-withlm
model can be used directly as follows:
import torch
import torchaudio
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
audio, rate = librosa.load(<path/to/audio_file>, sr=16000)
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())
Usage with a pipeline is even simpler...
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
def transcribe(audio):
return transcriber(audio)["text"]
transcribe(<path/or/url/to/any/audiofile>)
- Downloads last month
- 30
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm
Base model
facebook/wav2vec2-large-xlsr-53