pere commited on
Commit
167b691
1 Parent(s): c58b090

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +288 -77
README.md CHANGED
@@ -1,87 +1,298 @@
1
  ---
 
2
  language:
3
  - 'no'
4
- license: apache-2.0
5
- base_model: NbAiLab/nb-whisper-medium-v0.8-vad3
 
 
 
 
 
 
6
  tags:
7
  - audio
8
  - asr
9
  - automatic-speech-recognition
10
  - hf-asr-leaderboard
11
- model-index:
12
- - name: nb-whisper-medium-v0.8-vad3-verbatim
13
- results: []
 
 
 
 
 
 
 
14
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- <!-- This model card has been generated automatically according to the information Keras had access to. You should
17
- probably proofread and complete it, then remove this comment. -->
18
-
19
- # nb-whisper-medium-v0.8-vad3-verbatim
20
-
21
- This model is a fine-tuned version of [NbAiLab/nb-whisper-medium-v0.8-vad3](https://huggingface.co/NbAiLab/nb-whisper-medium-v0.8-vad3) on the NbAiLab/NPSC dataset.
22
- It achieves the following results on the evaluation set:
23
- - step: 249
24
- - validation_loss: 0.6296
25
- - train_loss: 0.4324
26
- - validation_wer: 8.2769
27
- - validation_cer: 2.8193
28
- - validation_exact_wer: 8.4048
29
- - validation_exact_cer: 2.8363
30
-
31
- ## Model description
32
-
33
- More information needed
34
-
35
- ## Intended uses & limitations
36
-
37
- More information needed
38
-
39
- ## Training and evaluation data
40
-
41
- More information needed
42
-
43
- ## Training procedure
44
-
45
- ### Training hyperparameters
46
-
47
- The following hyperparameters were used during training:
48
- - learning_rate: 2.5e-05
49
- - lr_scheduler_type: linear
50
- - per_device_train_batch_size: 32
51
- - total_train_batch_size_per_node: 128
52
- - total_train_batch_size: 1024
53
- - total_optimization_steps: 250
54
- - starting_optimization_step: None
55
- - finishing_optimization_step: 250
56
- - num_train_dataset_workers: 32
57
- - num_hosts: 8
58
- - total_num_training_examples: 256,000
59
- - steps_per_epoch: 45
60
- - num_beams: None
61
- - weight_decay: 0.01
62
- - adam_beta1: 0.9
63
- - adam_beta2: 0.98
64
- - adam_epsilon: 1e-06
65
- - dropout: True
66
- - bpe_dropout_probability: 0.2
67
- - activation_dropout_probability: 0.1
68
-
69
- ### Training results
70
-
71
- | step | validation_loss | train_loss | validation_wer | validation_cer | validation_exact_wer | validation_exact_cer |
72
- |:----:|:---------------:|:----------:|:--------------:|:--------------:|:--------------------:|:--------------------:|
73
- | 0 | 1.5895 | 1.4606 | 17.5605 | 10.5650 | 33.0099 | 13.8415 |
74
- | 40 | 0.6409 | 0.5035 | 9.1662 | 3.0250 | 9.3637 | 3.0542 |
75
- | 80 | 0.6309 | 0.4790 | 8.7132 | 2.9755 | 8.8730 | 2.9952 |
76
- | 120 | 0.6250 | 0.4480 | 8.4503 | 2.8812 | 8.6079 | 2.9019 |
77
- | 160 | 0.6294 | 0.4423 | 8.4000 | 2.8641 | 8.5345 | 2.8810 |
78
- | 200 | 0.6276 | 0.4467 | 8.3161 | 2.8345 | 8.4668 | 2.8534 |
79
- | 240 | 0.6287 | 0.4376 | 8.2266 | 2.7917 | 8.3597 | 2.8087 |
80
- | 249 | 0.6296 | 0.4324 | 8.2769 | 2.8193 | 8.4048 | 2.8363 |
81
-
82
-
83
- ### Framework versions
84
-
85
- - Transformers 4.34.1
86
- - Datasets 2.16.1
87
- - Tokenizers 0.14.1
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - 'no'
5
+ - nb
6
+ - nn
7
+ - en
8
+ datasets:
9
+ - NbAiLab/ncc_speech
10
+ - NbAiLab/NST
11
+ - NbAiLab/NPSC
12
+ base_model: openai/whisper-medium
13
  tags:
14
  - audio
15
  - asr
16
  - automatic-speech-recognition
17
  - hf-asr-leaderboard
18
+ metrics:
19
+ - wer
20
+ - cer
21
+ library_name: transformers
22
+ pipeline_tag: automatic-speech-recognition
23
+ widget:
24
+ - src: https://datasets-server.huggingface.co/assets/google/fleurs/--/nb_no/train/1/audio/audio.mp3
25
+ example_title: FLEURS sample 1
26
+ - src: https://datasets-server.huggingface.co/assets/google/fleurs/--/nb_no/train/4/audio/audio.mp3
27
+ example_title: FLEURS sample 2
28
  ---
29
+ # Finetuned Verbatim model.
30
+
31
+ This model is trained 200 additional steps on top of the model below. This makes it outputting only text in lowercase and without punctation. It is also considerably more verbatim, and will not make any attempt at correcting grammatical errors in the text
32
+
33
+ # NB-Whisper Medium Verbatim (Release Candidate)
34
+
35
+ **IMPORTANT:** These models are currently Release Candidates. We are in the final stages of testing. If everything proceeds smoothly, we plan to officially release the models later this month.
36
+
37
+ Introducing the **_Norwegian NB-Whisper Medium Verbatim model_**, proudly developed by the National Library of Norway. NB-Whisper is a cutting-edge series of models designed for automatic speech recognition (ASR) and speech translation. These models are based on the work of [OpenAI's Whisper](https://arxiv.org/abs/2212.04356). Each model in the series has been trained for 250,000 steps, utilizing a diverse dataset of 8 million samples. These samples consist of aligned audio clips, each 30 seconds long, culminating in a staggering 66,000 hours of speech. For an in-depth understanding of our training methodology and dataset composition, keep an eye out for our upcoming article.
38
+
39
+ | Model Size | Parameters | Model |
40
+ |------------|------------|------------|
41
+ | Tiny | 39M | [NB-Whisper Tiny](https://huggingface.co/NbAiLabBeta/nb-whisper-tiny) |
42
+ | Base | 74M | [NB-Whisper Base](https://huggingface.co/NbAiLabBeta/nb-whisper-base) |
43
+ | Small | 244M | [NB-Whisper Small](https://huggingface.co/NbAiLabBeta/nb-whisper-small) |
44
+ | Medium | 769M | [NB-Whisper Medium](https://huggingface.co/NbAiLabBeta/nb-whisper-medium) |
45
+ | Large | 1550M | [NB-Whisper Large](https://huggingface.co/NbAiLabBeta/nb-whisper-large) |
46
+
47
+
48
+
49
+ ### Specialised Models
50
+ While the main models are suitable for most transcription task, we demonstrate how easy it is to change the output of the main model. The following models are trained 250 additional steps from the main models above, and might be suitable for more targetted use cases:
51
+ - **Verbatim version**: This lower-cased variant is more literal and suitable for tasks requiring detailed transcription, such as linguistic analysis.
52
+ - **Semantic version**: This variant focuses less on verbatim accuracy but captures the essence of content, ideal for meeting minutes and subtitling.
53
+
54
+
55
+ | Model Size | Parameters | Verbatim version | Semantic version |
56
+ |------------|------------|------------|------------------|
57
+ | Tiny | 39M | [Tiny - verbatim](https://huggingface.co/NbAiLabBeta/nb-whisper-tiny-verbatim) | [Tiny - semantic](https://huggingface.co/NbAiLabBeta/nb-whisper-tiny-semantic) |
58
+ | Base | 74M | [Base - verbatim](https://huggingface.co/NbAiLabBeta/nb-whisper-base-verbatim) | [Base - semantic](https://huggingface.co/NbAiLabBeta/nb-whisper-base-semantic) |
59
+ | Small | 244M | [Small - verbatim](https://huggingface.co/NbAiLabBeta/nb-whisper-small-verbatim) | [Small - semantic](https://huggingface.co/NbAiLabBeta/nb-whisper-small-semantic) |
60
+ | Medium | 769M | [Medium - verbatim](https://huggingface.co/NbAiLabBeta/nb-whisper-medium-verbatim) | [Medium - semantic](https://huggingface.co/NbAiLabBeta/nb-whisper-medium-semantic) |
61
+ | Large | 1550M | [Large - verbatim](https://huggingface.co/NbAiLabBeta/nb-whisper-large-verbatim) | [Large - semantic](https://huggingface.co/NbAiLabBeta/nb-whisper-large-semantic) |
62
+
63
+
64
+ ### Model Description
65
+
66
+ - **Developed by:** [NB AI-Lab](https://ai.nb.no/)
67
+ - **Shared by:** [NB AI-Lab](https://ai.nb.no/)
68
+ - **Model type:** `whisper`
69
+ - **Language(s) (NLP):** Norwegian, Norwegian Bokmål, Norwegian Nynorsk, English
70
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
71
+ - **Trained from model:** [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)
72
+ - **Code Repository:** https://github.com/NbAiLab/nb-whisper/
73
+ - **Paper:** _Coming soon_
74
+ - **Demo:** _See Spaces on this page_
75
+
76
+
77
+ ## How to Use the Models
78
+
79
+ ### Online Demos
80
+ You can try the models directly through the HuggingFace Inference API, accessible on the right side of this page. Be aware that initially, the model needs to load and will run on limited CPU capacity, which might be slow. To enhance your experience, we are temporarily hosting some models on TPUs for a few days, significantly boosting their performance. Explore these under the **Spaces** section on the [Main Page](https://huggingface.co/NbAiLabBeta/).
81
+
82
+ ### Local Setup with HuggingFace
83
+ Alternatively, you can run the models locally. The Tiny, Base, and Small models are optimized for CPU execution. For the Medium and Large models, we recommend a system equipped with a GPU to ensure efficient processing. Setting up and using these models with HuggingFace's Transformers is straightforward, provided you have [Python](https://www.python.org/downloads/) installed on your machine. For practical demonstrations, refer to examples using this [sample mp3 file](https://github.com/NbAiLab/nb-whisper/raw/main/audio/king.mp3).
84
+
85
+ ```bash
86
+ # Download the sample file
87
+ $ wget -N https://github.com/NbAiLab/nb-whisper/raw/main/audio/king.mp3
88
+
89
+ # Install necessary libraries.
90
+ $ pip install transformers>=4.35.2
91
+ ```
92
+
93
+ After this is done, you should be able to run this in Python:
94
+
95
+ ```python
96
+ from transformers import pipeline
97
+
98
+ # Load the model
99
+ asr = pipeline("automatic-speech-recognition", "NbAiLabBeta/nb-whisper-medium-verbatim")
100
+
101
+ #transcribe
102
+ asr("king.mp3", generate_kwargs={'task': 'transcribe', 'language': 'no'})
103
+
104
+ ```
105
+
106
+ <details>
107
+ <summary>Expected output</summary>
108
+
109
+ ```json
110
+ {
111
+ {'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra.'}
112
+ }
113
+ ```
114
+ </details>
115
+
116
+ #### Extended HuggingFace
117
+ Examining the output above, we see that there are multiple repetitions at the end. This is because the video is longer than 30 seconds. By passing the ```chunk_lengt_s``` argument, we can transcribe longer file. Our experience is that we get slightly better result by setting that to 28 seconds instead of the default 30 seconds. We also recommend setting the beam size to 5 if possible. This greatly increases the accuracy but takes a bit longer and requires slightly more memory. The examples below also illustrates how to transcribe to English or Nynorsk, and how to get timestamps for sentences and words.
118
+
119
+ ```python
120
+ # Long Transcripts
121
+ asr("king.mp3", chunk_length_s=28, generate_kwargs={'task': 'transcribe', 'language': 'no'})
122
+
123
+ # Increase accuracy by setting beam size to 5
124
+ asr("king.mp3", chunk_length_s=28, return_timestamps=True, generate_kwargs={'num_beams': 5, 'task': 'transcribe', 'language': 'no'})
125
+
126
+ # Return Timestamps
127
+ asr("king.mp3", chunk_length_s=28, return_timestamps=True, generate_kwargs={'task': 'transcribe', 'language': 'no'})
128
+
129
+ # Return Word Level Timestamps
130
+ asr("king.mp3", chunk_length_s=28, return_timestamps="word", generate_kwargs={'task': 'transcribe', 'language': 'no'})
131
+
132
+ # Transcribe to Nynorsk
133
+ asr("king.mp3", chunk_length_s=28, generate_kwargs={'task': 'transcribe', 'language': 'nn'})
134
+
135
+ # Transcribe to English
136
+ asr("king.mp3", chunk_length_s=28, generate_kwargs={'task': 'transcribe', 'language': 'en'})
137
+
138
+ ```
139
+ <details>
140
+ <summary>Expected output</summary>
141
+
142
+ Long transcripts:
143
+ ```json
144
+ {
145
+ {'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra, hvilken nasjonalitet vi tilhører. Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres innenfor landegrenser. Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter, og jenter og gutter som er glad i hverandre. Nordmenn trommer på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbilis og Kari Bremnes. Med andre ord, Norge er dere. Norge er oss. Mitt største håp for Norge er at vi skal klare å ta vare på hverandre, at vi skal bygge dette landet videre på tillit, fellesskap og raushet.'}
146
+ }
147
+ ```
148
+
149
+ Timestamps:
150
+ ```json
151
+ {
152
+ {'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. hvilken nasjonalitet vi tilhører. Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres innenfor landegrenser. Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter, og jenter og gutter som er glad i hverandre. Nordmenn trommer på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbiles og Kari Bremnes. Med andre ord, Norge er dere. Norge er oss. Mitt største håp for Norge er at vi skal klare å ta vare på hverandre, at vi skal bygge dette landet videre på tillit, fellesskap og raushet.',
153
+ 'chunks': [{'timestamp': (0.0, 5.46),
154
+ 'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger'},
155
+ {'timestamp': (5.52, 8.68), 'text': ' og folk fra alle andre regioner.'},
156
+ {'timestamp': (8.68, 16.64),
157
+ 'text': ' Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria.'},
158
+ {'timestamp': (16.64, 13.3),
159
+ 'text': ' Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra.'},
160
+ {'timestamp': (13.32, 30.28),
161
+ 'text': ' Hvilken nasjonalitet vi er fra. hvilken nasjonalitet vi tilhører.'},
162
+ {'timestamp': (32.52, 39.16),
163
+ 'text': ' Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres'},
164
+ {'timestamp': (39.16, 42.0), 'text': ' innenfor landegrenser.'},
165
+ {'timestamp': (42.0, 46.74),
166
+ 'text': ' Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter,'},
167
+ {'timestamp': (46.74, 51.12),
168
+ 'text': ' og jenter og gutter som er glad i hverandre.'},
169
+ {'timestamp': (51.16, 57.42),
170
+ 'text': ' Nordmenn trommer på Gud, Allah, Altet og ingenting.'},
171
+ {'timestamp': (57.42, 64.3),
172
+ 'text': ' Nordmenn liker Grieg, Kygo, Helbiles og Kari Bremnes.'},
173
+ {'timestamp': (64.34, 71.24),
174
+ 'text': ' Med andre ord, Norge er dere. Norge er oss.'},
175
+ {'timestamp': (71.24, 78.04),
176
+ 'text': ' Mitt største håp for Norge er at vi skal klare å ta vare på hverandre,'},
177
+ {'timestamp': (78.12, 84.68),
178
+ 'text': ' at vi skal bygge dette landet videre på tillit, fellesskap og raushet.'}]}
179
+ }
180
+ ```
181
+
182
+ Word Level Timestamps:
183
+ ```json
184
+ {
185
+ {"text": "Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi tilhører. Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres innenfor landegrenser. Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter, og jenter og gutter som er glad i hverandre. Nordmenn trommer på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbilis og Kari Bremnes. Med andre ord, Norge er dere. Norge er oss. Mitt største håp for Norge er at vi skal klare å ta vare på hverandre, at vi skal bygge dette landet videre på tillit, fellesskap og raushet.",
186
+ "chunks": [
187
+ {"text": "Nordmenn", "timestamp": [0.72, 1.42]},
188
+ {"text": "er", "timestamp": [1.42, 1.74]},
189
+ // ... more chunks ...
190
+ {"text": "raushet.", "timestamp": [83.1, 84.88]}
191
+ ]
192
+ }
193
+ }
194
+ ```
195
+
196
+ Nynorsk:
197
+ ```json
198
+ {
199
+ {"text": "Nordmenn er nordlendingar, trøndarar, sørlendingar og folk frå alle andre regionar. Nordmenn er også innvandra frå Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikkje alltid så lett å seie kvar vi er frå, kva nasjonalitet vi tilhøyrer. Det vi kallar heim, er der hjartet vårt er, og det kan ikkje alltid plasserast innanfor landegrenser. Nordmenn er jenter som er glad i jenter, gutar som erade i gutar, og jenter og gutar som er glade i kvarandre. Nordmenn trommar på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbiles og Kari Bremnes. Med andre ord, Noreg er dere! Noreg er oss. Mitt største håp for Noreg er at vi skal klare å ta vare på kvarandre, at vi skal byggje dette landet vidare på tillit, fellesskap og raushet."}
200
+ }
201
+ ```
202
+
203
+ English:
204
+ ```json
205
+ {
206
+ {"text": "Norwegians are Norwegians, trønders, southerners and people from all other regions. Norwegians are also invaded from Afghanistan, Pakistan, Poland, Sweden, Somalia and Suria. It is not always so easy to say where we are from, what nationality we belong to. What we call home is where our heart is, and it cannot always be placed within national borders. Norwegians are girls who like girls, boys who like boys, and girls and boys who like each other. Norwegians thrump on God, Allah, Altet and nothing. Norwegians like Grieg, Kygo, Helbilis and Kari Bremnes. In other words, Norway is you. Norway is us. My biggest hope for Norway is that we should be able to take care of each other, that we should build this country on trust, community and generosity."}
207
+ }
208
+ ```
209
+
210
+ </details>
211
+
212
+ ### Whisper CPP
213
+ Whisper CPP is a C++ implementation of the Whisper model, offering the same functionalities with the added benefits of C++ efficiency and performance optimizations. This allows embedding any Whisper model into a binary file, facilitating the development of real applications. However, it requires some familiarity with compiling C++ programs. Their [homepage](https://github.com/ggerganov/whisper.cpp) provides examples of how to build applications, including real-time transcription.
214
+
215
+ We have converted this model to the ggml-format model used by Whisper CPP binaries. The file can be downloaded [here](blob/main/ggml-model.bin), and a `q5_0` quantized version is also available [here](blob/main/ggml-model-q5_0.bin).
216
+
217
+ ```bash
218
+ # We can download and compile whisper.cpp
219
+ $ git clone --depth 1 https://github.com/ggerganov/whisper.cpp --branch v1.5.1
220
+ $ cd whisper.cpp/
221
+ $ make
222
+
223
+ # We also need to convert the audio to WAV as that is the only format supported by whisper.cpp
224
+ $ wget -N https://github.com/NbAiLab/nb-whisper/raw/main/audio/king.mp3
225
+ $ ffmpeg -i king.mp3 -ar 16000 -ac 1 -c:a pcm_s16le king.wav
226
+
227
+ # Lets download the two ggml-files from this site
228
+ wget -N https://huggingface.co/NbAiLabBeta/nb-whisper-medium/resolve/main/ggml-model.bin -O models/nb-medium-ggml-model.bin
229
+ wget -N https://huggingface.co/NbAiLabBeta/nb-whisper-medium/resolve/main/ggml-model-q5_0.bin -O models/nb-medium-ggml-model-q5_0.bin
230
+
231
+ # And run it with the f16 default model
232
+ $ ./main -l no -m models/nb-medium-ggml-model.bin king.wav
233
+
234
+ # Or the quantized version
235
+ $ ./main -l no -m models/nb-medium-ggml-model-q5_0.bin king.wav
236
+ ```
237
+
238
+ ### WhisperX and Speaker Diarization
239
+ Speaker diarization is a technique in natural language processing and automatic speech recognition that identifies and separates different speakers in an audio recording. It segments the audio into parts based on who is speaking, enhancing the quality of transcribing meetings or phone calls. We find that [WhisperX](https://github.com/m-bain/whisperX) is the easiest way to use our models for diarizing speech. In addition, WhisperX is using phoneme-based Wav2Vec-models for improving the alignment of the timestamps. As of December 2023 it also has native support for using the nb-wav2vec-models. It currently uses [PyAnnote-audio](https://github.com/pyannote/pyannote-audio) for doing the actual diarization. This package has a fairly strict licence where you have to agree to user terms. Follow the instructions below.
240
+
241
+ ```bash
242
+ # Follow the install instructions on https://github.com/m-bain/whisperX
243
+ # Make sure you have a HuggingFace account and have agreed to the pyannote terms
244
+
245
+ # Log in (or supply HF Token in command line)
246
+ huggingface-cli login
247
+
248
+ # Download a test file
249
+ wget -N https://github.com/NbAiLab/nb-whisper/raw/main/audio/knuthamsun.mp3
250
+
251
+ # Optional. If you get complains about not support for Norwegian, do:
252
+ pip uninstall whisperx && pip install git+https://github.com/m-bain/whisperx.git@8540ff5985fceee764acbed94f656063d7f56540
253
+
254
+ # Transcribe the test file. All transcripts will end up in the directory of the mp3-file
255
+ whisperx knuthamsun.mp3 --model NbAiLabBeta/nb-whisper-medium-verbatim --language no --diarize
256
+
257
+ ```
258
+
259
+ You can also run WhisperX from Python. Please take a look at the instructions on [WhisperX homepage](https://github.com/m-bain/whisperX).
260
+
261
+
262
+
263
+
264
+ ### API
265
+ Instructions for accessing the models via a simple API are included in the demos under Spaces. Note that these demos are temporary and will only be available for a few weeks.
266
+
267
+ ## Training Data
268
+ The training data originates from Språkbanken and the National Library of Norway's digital collection, including:
269
+
270
+ - NST Norwegian ASR Database (16 kHz) and its corresponding dataset
271
+ - Transcribed speeches from the Norwegian Parliament by Språkbanken
272
+ - TV broadcast (NRK) subtitles (NLN digital collection)
273
+ - Audiobooks (NLN digital collection)
274
+
275
+ ## Downstream Use
276
+
277
+ The models, especially the smaller ones, may exhibit occasional hallucinations and may drop parts of the transcript. They are designed to convert spoken language into grammatically correct written sentences, which might not always be word-for-word translations. We have made two extra model variant for users that want a different transcription style. We encourage users to try the models themselves to get a better understanding.
278
+
279
+ ## Bias, Risks, and Limitations
280
+
281
+ Using these models without adequate risk assessment and mitigation could be considered irresponsible. They may contain biases or other undesirable distortions. Users who deploy these models or integrate them into systems or services are responsible for mitigating risks and complying with applicable AI regulations. The National Library of Norway, as the model owner, disclaims liability for any outcomes resulting from third-party use of these models.
282
+
283
+ ### Software
284
+ The model was trained using Jax/Flax and converted to PyTorch, Tensorflow, whisper.cpp, and ONXX formats. These are available under `Files and versions`. We welcome requests for conversion to other formats. All training code and scripts are released under the Apache License 2.0 in the GitHub repository [nb-whisper](https://github.com/NbAiLab/nb-whisper/).
285
+
286
+ ## Citation & Contributors
287
+ The NB-Whisper Medium Verbatim model is a product of the NoSTram project led by Per Egil Kummervold ([@pere](https://huggingface.co/pere)) at the National Library of Norway. Key contributors include Javier de la Rosa ([@versae](https://huggingface.co/versae)), Freddy Wetjen ([@freddyw](https://huggingface.co/freddyw)), and Rolv-Arild Braaten ([@Rolv-Arild](https://huggingface.co/Rolv-Arild)). NB AI-Lab, under the direction of Svein Arne Brygfjeld ([@Brygfjeld](https://huggingface.co/Brygfjeld)), supported the project's successful completion. A detailed paper on our process and findings is forthcoming.
288
+
289
+ ## Disclaimer
290
+
291
+ The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models (The National Library of Norway) be liable for any results arising from the use made by third parties of these models.
292
+
293
+ ## Acknowledgements
294
+
295
+ Our gratitude extends to [Google TPU Research Cloud](https://sites.research.google/trc/about/) for training resources, Google Cloud for translation credits, and HuggingFace's Sanchit Ghandi for technical support. A special thank you to Per Erik Solberg at Språkbanken for the collaboration on the Stortinget corpus.
296
 
297
+ ## Contact
298
+ For feedback, technical concerns, or collaboration inquiries, please contact <a rel="noopener nofollow" href="mailto:[email protected]">[email protected]</a>. If you plan to include this model in your research, contact us for the latest information on our upcoming paper for citation purposes.