Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
tags:
|
4 |
+
- automatic-speech-recognition
|
5 |
+
- smi
|
6 |
+
- sami
|
7 |
+
library_name: transformers
|
8 |
+
language: fi
|
9 |
+
base_model:
|
10 |
+
- GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned
|
11 |
+
model-index:
|
12 |
+
- name: wav2vec2-base-sami-cont-pt-22k-finetuned
|
13 |
+
results:
|
14 |
+
- task:
|
15 |
+
name: Automatic Speech Recognition
|
16 |
+
type: automatic-speech-recognition
|
17 |
+
dataset:
|
18 |
+
name: Sami-1h-test
|
19 |
+
type: sami-1h-test
|
20 |
+
args: fi
|
21 |
+
metrics:
|
22 |
+
- name: Test WER
|
23 |
+
type: wer
|
24 |
+
value: 43.04
|
25 |
+
- name: Test CER
|
26 |
+
type: cer
|
27 |
+
value: 15.76
|
28 |
+
---
|
29 |
+
|
30 |
+
# Sámi Wav2vec2-Base ASR
|
31 |
+
|
32 |
+
[GetmanY1/wav2vec2-base-sami-cont-pt-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-cont-pt-22k) fine-tuned on 20 hours of 16kHz sampled speech audio from the [Sámi Parliament sessions](https://sametinget.kommunetv.no/archive).
|
33 |
+
|
34 |
+
When using the model make sure that your speech input is also sampled at 16Khz.
|
35 |
+
|
36 |
+
## Model description
|
37 |
+
|
38 |
+
The Sámi Wav2Vec2 Base has the same architecture and uses the same training objective as the English one described in [Paper](https://arxiv.org/abs/2006.11477).
|
39 |
+
|
40 |
+
[GetmanY1/wav2vec2-base-sami-cont-pt-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-cont-pt-22k) is a large-scale, 95-million parameter monolingual model pre-trained on 22.4k hours of unlabeled Sámi speech from [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/).
|
41 |
+
You can read more about the pre-trained model from [this paper](TODO).
|
42 |
+
|
43 |
+
The model was evaluated on 1 hour of out-of-domain read-aloud and spontaneous speech of varying audio quality.
|
44 |
+
|
45 |
+
## Intended uses
|
46 |
+
|
47 |
+
You can use this model for Sámi ASR (speech-to-text).
|
48 |
+
|
49 |
+
### How to use
|
50 |
+
|
51 |
+
To transcribe audio files the model can be used as a standalone acoustic model as follows:
|
52 |
+
|
53 |
+
```
|
54 |
+
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
|
55 |
+
from datasets import load_dataset
|
56 |
+
import torch
|
57 |
+
|
58 |
+
# load model and processor
|
59 |
+
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
|
60 |
+
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
|
61 |
+
|
62 |
+
# tokenize
|
63 |
+
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
|
64 |
+
|
65 |
+
# retrieve logits
|
66 |
+
logits = model(input_values).logits
|
67 |
+
|
68 |
+
# take argmax and decode
|
69 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
70 |
+
transcription = processor.batch_decode(predicted_ids)
|
71 |
+
```
|
72 |
+
|
73 |
+
### Prefix Beam Search
|
74 |
+
|
75 |
+
In our experiments (see [paper](TODO)), we observed a slight improvement in terms of Character Error Rate (CER) when using prefix beam search compared to greedy decoding, primarily due to a reduction in deletions. Below is our adapted version of [corticph/prefix-beam-search](https://github.com/corticph/prefix-beam-search) for use with wav2vec 2.0 in HuggingFace Transformers.
|
76 |
+
Note that an external language model (LM) **is not required**, as the function defaults to a uniform probability when none is provided.
|
77 |
+
|
78 |
+
```
|
79 |
+
import re
|
80 |
+
import numpy as np
|
81 |
+
|
82 |
+
def prefix_beam_search(ctc, lm=None, k=25, alpha=0.30, beta=5, prune=0.001):
|
83 |
+
"""
|
84 |
+
Performs prefix beam search on the output of a CTC network.
|
85 |
+
|
86 |
+
Args:
|
87 |
+
ctc (np.ndarray): The CTC output. Should be a 2D array (timesteps x alphabet_size)
|
88 |
+
lm (func): Language model function. Should take as input a string and output a probability.
|
89 |
+
k (int): The beam width. Will keep the 'k' most likely candidates at each timestep.
|
90 |
+
alpha (float): The language model weight. Should usually be between 0 and 1.
|
91 |
+
beta (float): The language model compensation term. The higher the 'alpha', the higher the 'beta'.
|
92 |
+
prune (float): Only extend prefixes with chars with an emission probability higher than 'prune'.
|
93 |
+
|
94 |
+
Returns:
|
95 |
+
string: The decoded CTC output.
|
96 |
+
"""
|
97 |
+
|
98 |
+
lm = (lambda l: 1) if lm is None else lm # if no LM is provided, just set to function returning 1
|
99 |
+
W = lambda l: re.findall(r'\w+[\s|>]', l)
|
100 |
+
alphabet = list({k: v for k, v in sorted(processor.tokenizer.vocab.items(), key=lambda item: item[1])})
|
101 |
+
alphabet = list(map(lambda x: x.replace(processor.tokenizer.special_tokens_map['eos_token'], '>') \
|
102 |
+
.replace(processor.tokenizer.special_tokens_map['pad_token'], '%') \
|
103 |
+
.replace('|', ' '), alphabet))
|
104 |
+
|
105 |
+
|
106 |
+
F = ctc.shape[1]
|
107 |
+
ctc = np.vstack((np.zeros(F), ctc)) # just add an imaginative zero'th step (will make indexing more intuitive)
|
108 |
+
T = ctc.shape[0]
|
109 |
+
|
110 |
+
# STEP 1: Initiliazation
|
111 |
+
O = ''
|
112 |
+
Pb, Pnb = defaultdict(Counter), defaultdict(Counter)
|
113 |
+
Pb[0][O] = 1
|
114 |
+
Pnb[0][O] = 0
|
115 |
+
A_prev = [O]
|
116 |
+
# END: STEP 1
|
117 |
+
|
118 |
+
# STEP 2: Iterations and pruning
|
119 |
+
for t in range(1, T):
|
120 |
+
pruned_alphabet = [alphabet[i] for i in np.where(ctc[t] > prune)[0]]
|
121 |
+
for l in A_prev:
|
122 |
+
if len(l) > 0 and l.endswith('>'):
|
123 |
+
Pb[t][l] = Pb[t - 1][l]
|
124 |
+
Pnb[t][l] = Pnb[t - 1][l]
|
125 |
+
continue
|
126 |
+
for c in pruned_alphabet:
|
127 |
+
c_ix = alphabet.index(c)
|
128 |
+
# END: STEP 2
|
129 |
+
|
130 |
+
# STEP 3: “Extending” with a blank
|
131 |
+
if c == '%':
|
132 |
+
Pb[t][l] += ctc[t][0] * (Pb[t - 1][l] + Pnb[t - 1][l])
|
133 |
+
# END: STEP 3
|
134 |
+
|
135 |
+
# STEP 4: Extending with the end character
|
136 |
+
else:
|
137 |
+
l_plus = l + c
|
138 |
+
if len(l) > 0 and l.endswith(c):
|
139 |
+
Pnb[t][l_plus] += ctc[t][c_ix] * Pb[t - 1][l]
|
140 |
+
Pnb[t][l] += ctc[t][c_ix] * Pnb[t - 1][l]
|
141 |
+
# END: STEP 4
|
142 |
+
|
143 |
+
# STEP 5: Extending with any other non-blank character and LM constraints
|
144 |
+
elif len(l.replace(' ', '')) > 0 and c in (' ', '>'):
|
145 |
+
lm_prob = lm(l_plus.strip(' >')) ** alpha
|
146 |
+
Pnb[t][l_plus] += lm_prob * ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
|
147 |
+
else:
|
148 |
+
Pnb[t][l_plus] += ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
|
149 |
+
# END: STEP 5
|
150 |
+
|
151 |
+
# STEP 6: Make use of discarded prefixes
|
152 |
+
if l_plus not in A_prev:
|
153 |
+
Pb[t][l_plus] += ctc[t][0] * (Pb[t - 1][l_plus] + Pnb[t - 1][l_plus])
|
154 |
+
Pnb[t][l_plus] += ctc[t][c_ix] * Pnb[t - 1][l_plus]
|
155 |
+
# END: STEP 6
|
156 |
+
|
157 |
+
# STEP 7: Select most probable prefixes
|
158 |
+
A_next = Pb[t] + Pnb[t]
|
159 |
+
sorter = lambda l: A_next[l] * (len(W(l)) + 1) ** beta
|
160 |
+
A_prev = sorted(A_next, key=sorter, reverse=True)[:k]
|
161 |
+
# END: STEP 7
|
162 |
+
|
163 |
+
return A_prev[0].strip('>')
|
164 |
+
|
165 |
+
def map_to_pred_prefix_beam_search(batch):
|
166 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
167 |
+
input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
|
168 |
+
with torch.no_grad():
|
169 |
+
logits = model(input_values.to(device)).logits
|
170 |
+
probs = torch.softmax(logits, dim=-1)
|
171 |
+
transcription = [prefix_beam_search(probs[0].cpu().numpy(), lm=None)]
|
172 |
+
batch["transcription"] = transcription
|
173 |
+
return batch
|
174 |
+
|
175 |
+
result = ds.map(map_to_pred_prefix_beam_search, batched=True, batch_size=1, remove_columns=["speech"])
|
176 |
+
```
|
177 |
+
|
178 |
+
## Team Members
|
179 |
+
|
180 |
+
- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
|
181 |
+
- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)
|
182 |
+
|
183 |
+
Feel free to contact us for more details 🤗
|