KB-Whisper Large
The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across FLEURS, CommonVoice and NST, our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3
. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with kb-whisper-small
outperforming openai/whisper-large-v3
(a model six times its size).
Model size | FLEURS | CommonVoice | NST | |
---|---|---|---|---|
tiny | KBLab | 13.2 | 12.9 | 11.2 |
OpenAI | 59.2 | 67.8 | 85.2 | |
base | KBLab | 9.1 | 8.7 | 7.8 |
OpenAI | 39.6 | 52.1 | 53.4 | |
small | KBLab | 7.3 | 6.4 | 6.6 |
OpenAI | 20.6 | 26.4 | 26.4 | |
medium | KBLab | 6.6 | 5.4 | 5.8 |
OpenAI | 12.1 | 15.8 | 17.1 | |
large-v3 | KBLab | 5.4 | 4.1 | 5.2 |
OpenAI | 7.8 | 9.5 | 11.3 |
Table: Word Error Rate (WER) comparison between KBLab's Whisper models and the corresponding OpenAI versions.
Usage
We provide checkpoints in different formats: Hugging Face
, whisper.cpp
(GGML), onnx
, and ctranslate2
(used in faster-whisper
and WhisperX
).
Hugging Face
Inference example for using KB-Whisper
with Hugging Face:
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3",
chunk_length_s=30,
generate_kwargs={"task": "transcribe", "language": "sv"})
Faster-whisper
Faster-whisper provides fast and efficient inference via a reimplementation of Whisper using ctranslate2
.
#### faster-whisper model ####
from faster_whisper import WhisperModel
model_id = "KBLab/kb-whisper-large"
model = WhisperModel(
model_id,
device="cuda",
compute_type="float16",
download_root="cache", # cache directory
# condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
)
# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
WhisperX
WhisperX provides a convenient method of getting accurate word level timestamps. The library combines (force aligns) the text output of Whisper with the accurate timestamps of Wav2vec2. We provide an example below of how to use KB-Whisper
together with KBLab/wav2vec2-large-voxrex-swedish.
import whisperx
device = "cuda"
audio_file = "audio.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
"KBLab/kb-whisper-large", device, compute_type=compute_type, download_root="cache" # cache_dir
)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment
# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
language_code=result["language"],
device=device,
model_name="KBLab/wav2vec2-large-voxrex-swedish",
model_dir="cache", # cache_dir
)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)
print(result["segments"]) # word level timestamps after alignment
Whisper.cpp / GGML
We provide GGML checkpoints used in the apps whisper.cpp
and MacWhisper
. To use our model with whisper.cpp
first clone the repository and build the library:
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release
To use the model you need to download one of the GGML checkpoints we have uploaded. You can either press the download buttons here, or download using wget
:
wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model.bin # Non-quantized version
Run inference by specifying the model path after the argument -m
, along with the path to the audio file as the last positional argument.
./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav
onnx (optimum) and transformers.js usage
You can use the onnx
checkpoints via Hugging Face's optimum
library in the following manner:
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor
model_id = "KBLab/kb-whisper-large"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
model_id,
cache_dir="cache",
subfolder="onnx",
)
import soundfile as sf
audio = sf.read("audio.wav")
inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)
An example of an app that runs inference locally in the browser with transformers.js
and KB-Whisper
can be found at https://whisper.mesu.re/ (created by Pierre Mesure). A template for setting up such an app with javascript can be found at https://github.com/xenova/whisper-web.
Training data
Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (BLEU >= 0.7
, weighted ROUGE-N >= 0.7
, CER of first and last 10 characters <= 0.2
).
Dataset | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
---|---|---|
Subtitles | 34,261 | 3,110 |
Riksdag | 21,949 | 5,119 |
ISOF | 54 | 54 |
NST | 250 | 250 |
Total | 56,514 | 8,533 |
The default when loading our models through Hugging Face is Stage 2. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the revision
in .from_pretrained()
. The pretrained checkpoints tag can for example be found here: pretrained-checkpoint
. The Stage 2 default model tag is named standard
. We supply a different stage 2 checkpoint -- with a more condensed style of transcribing -- under the name subtitle
.
Evaluation
WER
Model size | FLEURS | CommonVoice | NST | |
---|---|---|---|---|
tiny | KBLab | 13.2 | 12.9 | 11.2 |
OpenAI | 59.2 | 67.8 | 85.2 | |
base | KBLab | 9.1 | 8.7 | 7.8 |
OpenAI | 39.6 | 52.1 | 53.4 | |
small | KBLab | 7.3 | 6.4 | 6.6 |
OpenAI | 20.6 | 26.4 | 26.4 | |
medium | KBLab | 6.6 | 5.4 | 5.8 |
OpenAI | 12.1 | 15.8 | 17.1 | |
large-v3 | KBLab | 5.4 | 4.1 | 5.2 |
OpenAI | 7.8 | 9.5 | 11.3 |
BLEU Score
Model size | FLEURS | CommonVoice | NST | |
---|---|---|---|---|
tiny | KBLab | 76.6 | 73.7 | 74.3 |
OpenAI | 26.9 | 21.1 | 24.0 | |
base | KBLab | 83.2 | 79.9 | 78.3 |
OpenAI | 41.1 | 32.5 | 36.9 | |
small | KBLab | 86.6 | 83.5 | 79.6 |
OpenAI | 64.0 | 56.5 | 58.2 | |
medium | KBLab | 87.6 | 85.0 | 80.2 |
OpenAI | 77.1 | 70.1 | 68.9 | |
large-v3 | KBLab | 89.8 | 87.2 | 81.1 |
OpenAI | 84.9 | 79.1 | 75.1 |
Citation
Paper reference coming soon.
- Downloads last month
- 2,535