SWRA (SWARA)
SWRA (SWARA)
is a Speech to Text Transformer (S2T) model trained by @binarybardakshat for automatic speech recognition (ASR).
Model Description
SWRA (SWARA) is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively.
How to Use
As this is a standard sequence-to-sequence transformer model, you can use the generate
method to generate the transcripts by passing the speech features to the model.
Note: The Speech2TextProcessor
object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio
package before running this example.
Note: The feature extractor depends on torchaudio and the tokenizer depends on sentencepiece, so be sure to install those packages before running the examples.
You could either install those as extra speech dependencies with pip install transformers"[speech, sentencepiece]"
or install the packages separately with pip install torchaudio sentencepiece
.
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
model = Speech2TextForConditionalGeneration.from_pretrained("binarybardakshat/swra-swara")
processor = Speech2TextProcessor.from_pretrained("binarybardakshat/swra-swara")
ds = load_dataset(
"patrickvonplaten/librispeech_asr_dummy",
"clean",
split="validation"
)
input_features = processor(
ds[0]["audio"]["array"],
sampling_rate=16_000,
return_tensors="pt"
).input_features # Batch size 1
generated_ids = model.generate(input_features=input_features)
transcription = processor.batch_decode(generated_ids)
#### Evaluation on LibriSpeech Test
The following script shows how to evaluate this model on the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)
*"clean"* and *"other"* test dataset.
```python
from datasets import load_dataset
from evaluate import load
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") # change to "other" for other test dataset
wer = load("wer")
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)
def map_to_pred(batch):
features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
input_features = features.input_features.to("cuda")
attention_mask = features.attention_mask.to("cuda")
gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask)
batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0]
return batch
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))
Result (WER):
"clean" | "other" |
---|---|
4.3 | 9.0 |
Training data
The S2T-SMALL-LIBRISPEECH-ASR is trained on LibriSpeech ASR Corpus, a dataset consisting of approximately 1000 hours of 16kHz read English speech.
- Downloads last month
- 8
Dataset used to train Binarybardakshat/SWRA
Evaluation results
- Test WER on LibriSpeech (clean)test set self-reported4.300
- Test WER on LibriSpeech (other)test set self-reported9.000