valhalla commited on
Commit
762d8d7
·
1 Parent(s): 69548f4
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - librispeech_asr
5
+ tags:
6
+ - audio
7
+ - automatic-speech-recognition
8
+ license: MIT
9
+ ---
10
+
11
+
12
+ # S2T-MEDIUM-LIBRISPEECH-ASR
13
+
14
+ `s2t-medium-librispeech-asr` is a Speech to Text Transformer (S2T) model trained for automatic speech recognition (ASR).
15
+ The S2T model was proposed in [this paper](https://arxiv.org/abs/2010.05171) and released in
16
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text)
17
+
18
+
19
+ ## Model description
20
+
21
+ S2T is an end-to-end sequence-to-sequence transformer model. It is trained with standard
22
+ autoregressive cross-entropy loss and generates the transcripts autoregressively.
23
+
24
+ ## Intended uses & limitations
25
+
26
+ This model can be used for end-to-end speech recognition (ASR).
27
+ See the [model hub](https://huggingface.co/models?filter=speech_to_text_transformer) to look for other S2T checkpoints.
28
+
29
+
30
+ ### How to use
31
+
32
+ As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
33
+ transcripts by passing the speech features to the model.
34
+
35
+ *Note: The `Speech2TextProcessor` object uses [torchaudio](https://github.com/pytorch/audio) to extract the
36
+ filter bank features. Make sure to install the `torchaudio` package before running this example.*
37
+
38
+ To install `torchaudio` run `pip install torchaudio`
39
+
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import Speech2TextProcessor, Speech2TextTransformerForConditionalGeneration
44
+ from datasets import load_dataset
45
+ import soundfile as sf
46
+
47
+ model = Speech2TextTransformerForConditionalGeneration.from_pretrained("facebook/s2t-medium-librispeech-asr")
48
+ processor = Speech2Textprocessor.from_pretrained("facebook/s2t-medium-librispeech-asr")
49
+
50
+ def map_to_array(batch):
51
+ speech, _ = sf.read(batch["file"])
52
+ batch["speech"] = speech
53
+ return batch
54
+
55
+ ds = load_dataset(
56
+ "patrickvonplaten/librispeech_asr_dummy",
57
+ "clean",
58
+ split="validation"
59
+ )
60
+ ds = ds.map(map_to_array)
61
+
62
+ input_features = processor(
63
+ ds["speech"][0],
64
+ sampling_rate=16_000,
65
+ return_tensors="pt"
66
+ ).input_features # Batch size 1
67
+ generated_ids = model.generate(input_ids=input_features)
68
+
69
+ transcription = processor.batch_decode(generated_ids)
70
+ ```
71
+
72
+ #### Evaluation on LibriSpeech Test
73
+
74
+ The following script shows how to evaluate this model on the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)
75
+ *"clean"* and *"other"* test dataset.
76
+
77
+ ```python
78
+ from datasets import load_dataset, load_metric
79
+ from transformers import Speech2TextTransformerForConditionalGeneration, Speech2TextProcessor
80
+ import soundfile as sf
81
+
82
+ librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") # change to "other" for other test dataset
83
+ wer = load_metric("wer")
84
+
85
+ model = Speech2TextTransformerForConditionalGeneration.from_pretrained("facebook/s2t-medium-librispeech-asr").to("cuda")
86
+ processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-librispeech-asr", do_upper_case=True)
87
+
88
+ def map_to_array(batch):
89
+ speech, _ = sf.read(batch["file"])
90
+ batch["speech"] = speech
91
+ return batch
92
+
93
+ librispeech_eval = librispeech_eval.map(map_to_array)
94
+
95
+ def map_to_pred(batch):
96
+ features = processor(batch["speech"], sampling_rate=16000, padding=True, return_tensors="pt")
97
+ input_features = features.input_features.to("cuda")
98
+ attention_mask = features.attention_mask.to("cuda")
99
+
100
+ gen_tokens = model.generate(input_ids=input_features, attention_mask=attention_mask)
101
+ batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)
102
+ return batch
103
+
104
+ result = librispeech_eval.map(map_to_pred, batched=True, batch_size=8, remove_columns=["speech"])
105
+
106
+ print("WER:", wer(predictions=result["transcription"], references=result["text"]))
107
+ ```
108
+
109
+ *Result (WER)*:
110
+
111
+ | "clean" | "other" |
112
+ |:-------:|:-------:|
113
+ | 3.5 | 7.8 |
114
+
115
+
116
+
117
+ ## Training data
118
+
119
+ The S2T-MEDIUM-LIBRISPEECH-ASR is trained on [LibriSpeech ASR Corpus](https://www.openslr.org/12), a dataset consisting of
120
+ approximately 1000 hours of 16kHz read English speech.
121
+
122
+
123
+ ## Training procedure
124
+
125
+ ### Preprocessing
126
+
127
+ The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from
128
+ WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization)
129
+ is applied to each example.
130
+
131
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.
132
+
133
+
134
+ ### Training
135
+
136
+ The model is trained with standard autoregressive cross-entropy loss and using [SpecAugment](https://arxiv.org/abs/1904.08779).
137
+ The encoder receives speech features, and the decoder generates the transcripts autoregressively.
138
+
139
+
140
+ ### BibTeX entry and citation info
141
+
142
+ ```bibtex
143
+ @inproceedings{wang2020fairseqs2t,
144
+ title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
145
+ author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
146
+ booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
147
+ year = {2020},
148
+ }
149
+
150
+ ```
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.15,
3
+ "activation_function": "relu",
4
+ "architectures": [
5
+ "Speech2TextTransformerForConditionalGeneration"
6
+ ],
7
+ "attention_dropout": 0.15,
8
+ "bos_token_id": 0,
9
+ "classifier_dropout": 0.0,
10
+ "conv_channels": 1024,
11
+ "conv_kernel_sizes": [
12
+ 5,
13
+ 5
14
+ ],
15
+ "d_model": 512,
16
+ "decoder_attention_heads": 8,
17
+ "decoder_ffn_dim": 2048,
18
+ "decoder_layerdrop": 0.0,
19
+ "decoder_layers": 6,
20
+ "decoder_start_token_id": 2,
21
+ "dropout": 0.15,
22
+ "early_stopping": true,
23
+ "encoder_attention_heads": 8,
24
+ "encoder_ffn_dim": 2048,
25
+ "encoder_layerdrop": 0.0,
26
+ "encoder_layers": 12,
27
+ "eos_token_id": 2,
28
+ "gradient_checkpointing": false,
29
+ "init_std": 0.02,
30
+ "input_channels": 1,
31
+ "input_feat_per_channel": 80,
32
+ "is_encoder_decoder": true,
33
+ "max_length": 200,
34
+ "max_source_positions": 6000,
35
+ "max_target_positions": 1024,
36
+ "model_type": "speech_to_text_transformer",
37
+ "num_beams": 5,
38
+ "num_conv_layers": 2,
39
+ "num_hidden_layers": 12,
40
+ "pad_token_id": 1,
41
+ "scale_embedding": true,
42
+ "transformers_version": "4.4.0.dev0",
43
+ "use_cache": true,
44
+ "vocab_size": 10000
45
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_size": 80,
4
+ "norm_means": true,
5
+ "norm_vars": true,
6
+ "num_mel_bins": 80,
7
+ "padding_side": "right",
8
+ "padding_value": 0.0,
9
+ "return_attention_mask": true,
10
+ "sampling_rate": 16000
11
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b016d9dde06a7d3f73855d19ef597b61055a5ef6d9ec0d12132c7f4077e2aea
3
+ size 284968270
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:052a168787a9160b4b2ba54e4995e9600298812c34191ca3f70cea51cd4f5c1e
3
+ size 416684
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "do_upper_case": false, "do_lower_case": true, "tgt_lang": null, "lang_codes": null, "special_tokens_map_file": "/home/suraj/.cache/huggingface/transformers/f39f1499e9c4d2b3e803e3cad8a31c4cf3e626e1c69197d4cd6921e5c07007f9.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd", "tokenizer_file": null, "name_or_path": "hf_models_fb/s2t-small-librispeech-asr/"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff