GetmanY1
/

wav2vec2-base-sami-cont-pt-22k-finetuned

+---
+license: apache-2.0
+tags:
+- automatic-speech-recognition
+- smi
+- sami
+library_name: transformers
+language: fi
+base_model:
+- GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned
+model-index:
+  - name: wav2vec2-base-sami-cont-pt-22k-finetuned
+    results:
+      - task:
+          name: Automatic Speech Recognition
+          type: automatic-speech-recognition
+        dataset:
+          name: Sami-1h-test
+          type: sami-1h-test
+          args: fi
+        metrics:
+          - name: Test WER
+            type: wer
+            value: 43.04
+          - name: Test CER
+            type: cer
+            value: 15.76
+---
+# Sámi Wav2vec2-Base ASR
+[GetmanY1/wav2vec2-base-sami-cont-pt-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-cont-pt-22k) fine-tuned on 20 hours of 16kHz sampled speech audio from the [Sámi Parliament sessions](https://sametinget.kommunetv.no/archive).
+When using the model make sure that your speech input is also sampled at 16Khz.
+## Model description
+The Sámi Wav2Vec2 Base has the same architecture and uses the same training objective as the English one described in [Paper](https://arxiv.org/abs/2006.11477).
+[GetmanY1/wav2vec2-base-sami-cont-pt-22k](https://huggingface.co/GetmanY1/wav2vec2-base-sami-cont-pt-22k) is a large-scale, 95-million parameter monolingual model pre-trained on 22.4k hours of unlabeled Sámi speech from [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/).
+You can read more about the pre-trained model from [this paper](TODO).
+The model was evaluated on 1 hour of out-of-domain read-aloud and spontaneous speech of varying audio quality.
+## Intended uses
+You can use this model for Sámi ASR (speech-to-text).
+### How to use
+To transcribe audio files the model can be used as a standalone acoustic model as follows:
+```
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from datasets import load_dataset
+import torch
+# load model and processor
+processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
+model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
+# tokenize
+input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
+# retrieve logits
+logits = model(input_values).logits
+# take argmax and decode
+predicted_ids = torch.argmax(logits, dim=-1)
+transcription = processor.batch_decode(predicted_ids)
+```
+### Prefix Beam Search
+In our experiments (see [paper](TODO)), we observed a slight improvement in terms of Character Error Rate (CER) when using prefix beam search compared to greedy decoding, primarily due to a reduction in deletions. Below is our adapted version of [corticph/prefix-beam-search](https://github.com/corticph/prefix-beam-search) for use with wav2vec 2.0 in HuggingFace Transformers.
+Note that an external language model (LM) **is not required**, as the function defaults to a uniform probability when none is provided.
+```
+import re
+import numpy as np
+def prefix_beam_search(ctc, lm=None, k=25, alpha=0.30, beta=5, prune=0.001):
+	"""
+	Performs prefix beam search on the output of a CTC network.
+	Args:
+		ctc (np.ndarray): The CTC output. Should be a 2D array (timesteps x alphabet_size)
+		lm (func): Language model function. Should take as input a string and output a probability.
+		k (int): The beam width. Will keep the 'k' most likely candidates at each timestep.
+		alpha (float): The language model weight. Should usually be between 0 and 1.
+		beta (float): The language model compensation term. The higher the 'alpha', the higher the 'beta'.
+		prune (float): Only extend prefixes with chars with an emission probability higher than 'prune'.
+	Returns:
+		string: The decoded CTC output.
+	"""
+	lm = (lambda l: 1) if lm is None else lm # if no LM is provided, just set to function returning 1
+	W = lambda l: re.findall(r'\w+[\s|>]', l)
+	alphabet = list({k: v for k, v in sorted(processor.tokenizer.vocab.items(), key=lambda item: item[1])})
+	alphabet = list(map(lambda x: x.replace(processor.tokenizer.special_tokens_map['eos_token'], '>') \
+						.replace(processor.tokenizer.special_tokens_map['pad_token'], '%') \
+						.replace('|', ' '), alphabet))
+	F = ctc.shape[1]
+	ctc = np.vstack((np.zeros(F), ctc)) # just add an imaginative zero'th step (will make indexing more intuitive)
+	T = ctc.shape[0]
+	# STEP 1: Initiliazation
+	O = ''
+	Pb, Pnb = defaultdict(Counter), defaultdict(Counter)
+	Pb[0][O] = 1
+	Pnb[0][O] = 0
+	A_prev = [O]
+	# END: STEP 1
+	# STEP 2: Iterations and pruning
+	for t in range(1, T):
+		pruned_alphabet = [alphabet[i] for i in np.where(ctc[t] > prune)[0]]
+		for l in A_prev:
+			if len(l) > 0 and l.endswith('>'):
+				Pb[t][l] = Pb[t - 1][l]
+				Pnb[t][l] = Pnb[t - 1][l]
+				continue
+			for c in pruned_alphabet:
+				c_ix = alphabet.index(c)
+				# END: STEP 2
+				# STEP 3: “Extending” with a blank
+				if c == '%':
+					Pb[t][l] += ctc[t][0] * (Pb[t - 1][l] + Pnb[t - 1][l])
+				# END: STEP 3
+				# STEP 4: Extending with the end character
+				else:
+					l_plus = l + c
+					if len(l) > 0 and l.endswith(c):
+						Pnb[t][l_plus] += ctc[t][c_ix] * Pb[t - 1][l]
+						Pnb[t][l] += ctc[t][c_ix] * Pnb[t - 1][l]
+				# END: STEP 4
+					# STEP 5: Extending with any other non-blank character and LM constraints
+					elif len(l.replace(' ', '')) > 0 and c in (' ', '>'):
+						lm_prob = lm(l_plus.strip(' >')) ** alpha
+						Pnb[t][l_plus] += lm_prob * ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
+					else:
+						Pnb[t][l_plus] += ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
+					# END: STEP 5
+					# STEP 6: Make use of discarded prefixes
+					if l_plus not in A_prev:
+						Pb[t][l_plus] += ctc[t][0] * (Pb[t - 1][l_plus] + Pnb[t - 1][l_plus])
+						Pnb[t][l_plus] += ctc[t][c_ix] * Pnb[t - 1][l_plus]
+					# END: STEP 6
+		# STEP 7: Select most probable prefixes
+		A_next = Pb[t] + Pnb[t]
+		sorter = lambda l: A_next[l] * (len(W(l)) + 1) ** beta
+		A_prev = sorted(A_next, key=sorter, reverse=True)[:k]
+		# END: STEP 7
+	return A_prev[0].strip('>')
+def map_to_pred_prefix_beam_search(batch):
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
+    with torch.no_grad():
+        logits = model(input_values.to(device)).logits
+    probs = torch.softmax(logits, dim=-1)
+    transcription = [prefix_beam_search(probs[0].cpu().numpy(), lm=None)]
+    batch["transcription"] = transcription
+    return batch
+result = ds.map(map_to_pred_prefix_beam_search, batched=True, batch_size=1, remove_columns=["speech"])
+```
+## Team Members
+- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
+- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)
+Feel free to contact us for more details 🤗