GetmanY1 commited on
Commit
32ca80f
·
verified ·
1 Parent(s): b94de37

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -0
README.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - automatic-speech-recognition
5
+ - smi
6
+ - sami
7
+ library_name: transformers
8
+ language: fi
9
+ base_model:
10
+ - GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned
11
+ model-index:
12
+ - name: wav2vec2-base-sami-cont-pt-22k-finetuned
13
+ results:
14
+ - task:
15
+ name: Automatic Speech Recognition
16
+ type: automatic-speech-recognition
17
+ dataset:
18
+ name: Sami-1h-test
19
+ type: sami-1h-test
20
+ args: fi
21
+ metrics:
22
+ - name: Test WER
23
+ type: wer
24
+ value: 43.04
25
+ - name: Test CER
26
+ type: cer
27
+ value: 15.76
28
+ ---
29
+
30
+ # Sámi Wav2vec2-Base ASR
31
+
32
+ [GetmanY1/wav2vec2-base-sami-cont-pt-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-cont-pt-22k) fine-tuned on 20 hours of 16kHz sampled speech audio from the [Sámi Parliament sessions](https://sametinget.kommunetv.no/archive).
33
+
34
+ When using the model make sure that your speech input is also sampled at 16Khz.
35
+
36
+ ## Model description
37
+
38
+ The Sámi Wav2Vec2 Base has the same architecture and uses the same training objective as the English one described in [Paper](https://arxiv.org/abs/2006.11477).
39
+
40
+ [GetmanY1/wav2vec2-base-sami-cont-pt-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-cont-pt-22k) is a large-scale, 95-million parameter monolingual model pre-trained on 22.4k hours of unlabeled Sámi speech from [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/).
41
+ You can read more about the pre-trained model from [this paper](TODO).
42
+
43
+ The model was evaluated on 1 hour of out-of-domain read-aloud and spontaneous speech of varying audio quality.
44
+
45
+ ## Intended uses
46
+
47
+ You can use this model for Sámi ASR (speech-to-text).
48
+
49
+ ### How to use
50
+
51
+ To transcribe audio files the model can be used as a standalone acoustic model as follows:
52
+
53
+ ```
54
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
55
+ from datasets import load_dataset
56
+ import torch
57
+
58
+ # load model and processor
59
+ processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
60
+ model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
61
+
62
+ # tokenize
63
+ input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
64
+
65
+ # retrieve logits
66
+ logits = model(input_values).logits
67
+
68
+ # take argmax and decode
69
+ predicted_ids = torch.argmax(logits, dim=-1)
70
+ transcription = processor.batch_decode(predicted_ids)
71
+ ```
72
+
73
+ ### Prefix Beam Search
74
+
75
+ In our experiments (see [paper](TODO)), we observed a slight improvement in terms of Character Error Rate (CER) when using prefix beam search compared to greedy decoding, primarily due to a reduction in deletions. Below is our adapted version of [corticph/prefix-beam-search](https://github.com/corticph/prefix-beam-search) for use with wav2vec 2.0 in HuggingFace Transformers.
76
+ Note that an external language model (LM) **is not required**, as the function defaults to a uniform probability when none is provided.
77
+
78
+ ```
79
+ import re
80
+ import numpy as np
81
+
82
+ def prefix_beam_search(ctc, lm=None, k=25, alpha=0.30, beta=5, prune=0.001):
83
+ """
84
+ Performs prefix beam search on the output of a CTC network.
85
+
86
+ Args:
87
+ ctc (np.ndarray): The CTC output. Should be a 2D array (timesteps x alphabet_size)
88
+ lm (func): Language model function. Should take as input a string and output a probability.
89
+ k (int): The beam width. Will keep the 'k' most likely candidates at each timestep.
90
+ alpha (float): The language model weight. Should usually be between 0 and 1.
91
+ beta (float): The language model compensation term. The higher the 'alpha', the higher the 'beta'.
92
+ prune (float): Only extend prefixes with chars with an emission probability higher than 'prune'.
93
+
94
+ Returns:
95
+ string: The decoded CTC output.
96
+ """
97
+
98
+ lm = (lambda l: 1) if lm is None else lm # if no LM is provided, just set to function returning 1
99
+ W = lambda l: re.findall(r'\w+[\s|>]', l)
100
+ alphabet = list({k: v for k, v in sorted(processor.tokenizer.vocab.items(), key=lambda item: item[1])})
101
+ alphabet = list(map(lambda x: x.replace(processor.tokenizer.special_tokens_map['eos_token'], '>') \
102
+ .replace(processor.tokenizer.special_tokens_map['pad_token'], '%') \
103
+ .replace('|', ' '), alphabet))
104
+
105
+
106
+ F = ctc.shape[1]
107
+ ctc = np.vstack((np.zeros(F), ctc)) # just add an imaginative zero'th step (will make indexing more intuitive)
108
+ T = ctc.shape[0]
109
+
110
+ # STEP 1: Initiliazation
111
+ O = ''
112
+ Pb, Pnb = defaultdict(Counter), defaultdict(Counter)
113
+ Pb[0][O] = 1
114
+ Pnb[0][O] = 0
115
+ A_prev = [O]
116
+ # END: STEP 1
117
+
118
+ # STEP 2: Iterations and pruning
119
+ for t in range(1, T):
120
+ pruned_alphabet = [alphabet[i] for i in np.where(ctc[t] > prune)[0]]
121
+ for l in A_prev:
122
+ if len(l) > 0 and l.endswith('>'):
123
+ Pb[t][l] = Pb[t - 1][l]
124
+ Pnb[t][l] = Pnb[t - 1][l]
125
+ continue
126
+ for c in pruned_alphabet:
127
+ c_ix = alphabet.index(c)
128
+ # END: STEP 2
129
+
130
+ # STEP 3: “Extending” with a blank
131
+ if c == '%':
132
+ Pb[t][l] += ctc[t][0] * (Pb[t - 1][l] + Pnb[t - 1][l])
133
+ # END: STEP 3
134
+
135
+ # STEP 4: Extending with the end character
136
+ else:
137
+ l_plus = l + c
138
+ if len(l) > 0 and l.endswith(c):
139
+ Pnb[t][l_plus] += ctc[t][c_ix] * Pb[t - 1][l]
140
+ Pnb[t][l] += ctc[t][c_ix] * Pnb[t - 1][l]
141
+ # END: STEP 4
142
+
143
+ # STEP 5: Extending with any other non-blank character and LM constraints
144
+ elif len(l.replace(' ', '')) > 0 and c in (' ', '>'):
145
+ lm_prob = lm(l_plus.strip(' >')) ** alpha
146
+ Pnb[t][l_plus] += lm_prob * ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
147
+ else:
148
+ Pnb[t][l_plus] += ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
149
+ # END: STEP 5
150
+
151
+ # STEP 6: Make use of discarded prefixes
152
+ if l_plus not in A_prev:
153
+ Pb[t][l_plus] += ctc[t][0] * (Pb[t - 1][l_plus] + Pnb[t - 1][l_plus])
154
+ Pnb[t][l_plus] += ctc[t][c_ix] * Pnb[t - 1][l_plus]
155
+ # END: STEP 6
156
+
157
+ # STEP 7: Select most probable prefixes
158
+ A_next = Pb[t] + Pnb[t]
159
+ sorter = lambda l: A_next[l] * (len(W(l)) + 1) ** beta
160
+ A_prev = sorted(A_next, key=sorter, reverse=True)[:k]
161
+ # END: STEP 7
162
+
163
+ return A_prev[0].strip('>')
164
+
165
+ def map_to_pred_prefix_beam_search(batch):
166
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
167
+ input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
168
+ with torch.no_grad():
169
+ logits = model(input_values.to(device)).logits
170
+ probs = torch.softmax(logits, dim=-1)
171
+ transcription = [prefix_beam_search(probs[0].cpu().numpy(), lm=None)]
172
+ batch["transcription"] = transcription
173
+ return batch
174
+
175
+ result = ds.map(map_to_pred_prefix_beam_search, batched=True, batch_size=1, remove_columns=["speech"])
176
+ ```
177
+
178
+ ## Team Members
179
+
180
+ - Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
181
+ - Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)
182
+
183
+ Feel free to contact us for more details 🤗