facebook
/

wav2vec2-large-960h

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

sanchit-gandhi HF staff commited on Sep 19, 2023

Commit

934c622

•

1 Parent(s): bdeaacd

Update README.md

Files changed (1) hide show

README.md +20 -10

README.md CHANGED Viewed

@@ -55,15 +55,22 @@ To transcribe audio files the model can be used as a standalone acoustic model a
 ## Evaluation
-This code snippet shows how to evaluate **facebook/wav2vec2-large-960h** on LibriSpeech's "clean" and "other" test data.
 ```python
 from datasets import load_dataset
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
-import soundfile as sf
-import torch
-from jiwer import wer
 librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
@@ -71,18 +78,21 @@ model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h").to("cuda"
 processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
 def map_to_pred(batch):
-    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
     with torch.no_grad():
         logits = model(input_values.to("cuda")).logits
     predicted_ids = torch.argmax(logits, dim=-1)
     transcription = processor.batch_decode(predicted_ids)
-    batch["transcription"] = transcription
     return batch
-result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
-print("WER:", wer(result["text"], result["transcription"]))
 ```
 *Result (WER)*:

 ## Evaluation
+First, ensure the required Python packages are installed. We'll require `transformers` for running the Wav2Vec2 model,
+`datasets` for loading the LibriSpeech dataset, and `evaluate` plus `jiwer` for computing the word-error rate (WER):
+```
+pip install --upgrade pip
+pip install --upgrade transformers datasets evaluate jiwer
+```
+The following code snippet shows how to evaluate **facebook/wav2vec2-large-960h** on LibriSpeech's "clean" and "other" test data.
+The batch size can be set according to your device, and is set to `8` by default:
 ```python
+import torch
 from datasets import load_dataset
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+from evaluate import load
 librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
 processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
 def map_to_pred(batch):
+    audios = [audio["array"] for audio in batch["audio"]]
+    sampling_rate = batch["audio"][0]["sampling_rate"]
+    input_values = processor(audios, sampling_rate=sampling_rate, return_tensors="pt", padding="longest").input_values
     with torch.no_grad():
         logits = model(input_values.to("cuda")).logits
     predicted_ids = torch.argmax(logits, dim=-1)
     transcription = processor.batch_decode(predicted_ids)
+    batch["transcription"] = [t for t in transcription]
     return batch
+result = librispeech_eval.map(map_to_pred, batched=True, batch_size=8, remove_columns=["audio"])
+wer = load("wer")
+print("WER:", wer.compute(references=result["text"], predictions=result["transcription"]))
 ```
 *Result (WER)*: