Commit
·
9883158
1
Parent(s):
fc8b3e0
Update README.md
Browse files
README.md
CHANGED
|
@@ -12,4 +12,113 @@ pipeline_tag: automatic-speech-recognition
|
|
| 12 |
license: apache-2.0
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# Wav2Vec2-XLS-R-2B-22-16
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
license: apache-2.0
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Wav2Vec2-XLS-R-2B-22-16
|
| 16 |
+
|
| 17 |
+
Facebook's Wav2Vec2 XLS-R fine-tuned for **Speech Translation.**
|
| 18 |
+
|
| 19 |
+

|
| 20 |
+
|
| 21 |
+
This is a [SpeechEncoderDecoderModel](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html) model.
|
| 22 |
+
The encoder was warm-started from the [**`facebook/wav2vec2-xls-r-2b`**](https://huggingface.co/facebook/wav2vec2-xls-r-2b) checkpoint and
|
| 23 |
+
the decoder from the [**`facebook/mbart-large-50`**](https://huggingface.co/facebook/mbart-large-50) checkpoint.
|
| 24 |
+
Consequently, the encoder-decoder model was fine-tuned on `{input_lang}` -> `{output_lang}` translation pairs
|
| 25 |
+
of the [Covost2 dataset](https://huggingface.co/datasets/covost2).
|
| 26 |
+
|
| 27 |
+
The model can translate from the following spoken languages `{input_lang}` to the following written languages `{output_lang}`:
|
| 28 |
+
|
| 29 |
+
`{input_lang}` -> `{output_lang}`
|
| 30 |
+
|
| 31 |
+
with `{input_lang}` one of:
|
| 32 |
+
|
| 33 |
+
{`en`, `fr`, `de`, `es`, `ca`, `it`, `ru`, `zh-CN`, `pt`, `fa`, `et`, `mn`, `nl`, `tr`, `ar`, `sv-SE`, `lv`, `sl`, `ta`, `ja`, `id`, `cy`}
|
| 34 |
+
|
| 35 |
+
and `{output_lang}`:
|
| 36 |
+
|
| 37 |
+
{`en`, `de`, `tr`, `fa`, `sv-SE`, `mn`, `zh-CN`, `cy`, `ca`, `sl`, `et`, `id`, `ar`, `ta`, `lv`, `ja`}
|
| 38 |
+
|
| 39 |
+
## Usage
|
| 40 |
+
|
| 41 |
+
### Demo
|
| 42 |
+
|
| 43 |
+
The model can be tested on [this space](https://huggingface.co/spaces/facebook/XLS-R-2B-22-16).
|
| 44 |
+
You can select the target language, record some audio in any of the above mentioned input languages,
|
| 45 |
+
and then sit back and see how well the checkpoint can translate the input.
|
| 46 |
+
|
| 47 |
+
### Example
|
| 48 |
+
|
| 49 |
+
As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
|
| 50 |
+
transcripts by passing the speech features to the model.
|
| 51 |
+
|
| 52 |
+
You can use the model directly via the ASR pipeline. By default, the checkpoint will
|
| 53 |
+
translate spoken English to written German. To change the written target language,
|
| 54 |
+
you need to pass the correct `forced_bos_token_id` to `generate(...)` to condition
|
| 55 |
+
the decoder on the correct target language.
|
| 56 |
+
|
| 57 |
+
To select the correct `forced_bos_token_id` given your choosen language id, please make use
|
| 58 |
+
of the following mapping:
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
MAPPING = {
|
| 62 |
+
"en": 250004,
|
| 63 |
+
"de": 250003,
|
| 64 |
+
"tr": 250023,
|
| 65 |
+
"fa": 250029,
|
| 66 |
+
"sv": 250042,
|
| 67 |
+
"mn": 250037,
|
| 68 |
+
"zh": 250025,
|
| 69 |
+
"cy": 250007,
|
| 70 |
+
"ca": 250005,
|
| 71 |
+
"sl": 250052,
|
| 72 |
+
"et": 250006,
|
| 73 |
+
"id": 250032,
|
| 74 |
+
"ar": 250001,
|
| 75 |
+
"ta": 250044,
|
| 76 |
+
"lv": 250017,
|
| 77 |
+
"ja": 250012,
|
| 78 |
+
}
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
As an example, if you would like to translate to Swedish, you can do the following:
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
from datasets import load_dataset
|
| 85 |
+
from transformers import pipeline
|
| 86 |
+
|
| 87 |
+
# select correct `forced_bos_token_id`
|
| 88 |
+
forced_bos_token_id = MAPPING["sv"]
|
| 89 |
+
|
| 90 |
+
# replace following lines to load an audio file of your choice
|
| 91 |
+
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
| 92 |
+
audio_file = librispeech_en[0]["file"]
|
| 93 |
+
|
| 94 |
+
asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-2b-22-to-16", feature_extractor="facebook/wav2vec2-xls-r-2b-22-to-16")
|
| 95 |
+
|
| 96 |
+
translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
or step-by-step as follows:
|
| 100 |
+
|
| 101 |
+
```python
|
| 102 |
+
import torch
|
| 103 |
+
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
|
| 104 |
+
from datasets import load_dataset
|
| 105 |
+
|
| 106 |
+
model = SpeechEncoderDecoder.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
|
| 107 |
+
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
|
| 108 |
+
|
| 109 |
+
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
| 110 |
+
|
| 111 |
+
# select correct `forced_bos_token_id`
|
| 112 |
+
forced_bos_token_id = MAPPING["sv"]
|
| 113 |
+
|
| 114 |
+
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
|
| 115 |
+
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=forced_bos_token)
|
| 116 |
+
transcription = processor.batch_decode(generated_ids)
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## More XLS-R models for `{lang}` -> `en` Speech Translation
|
| 120 |
+
|
| 121 |
+
- [Wav2Vec2-XLS-R-300M-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-300m-en-to-15)
|
| 122 |
+
- [Wav2Vec2-XLS-R-1B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-1b-en-to-15)
|
| 123 |
+
- [Wav2Vec2-XLS-R-2B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-2b-en-to-15)
|
| 124 |
+
- [Wav2Vec2-XLS-R-2B-22-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16)
|