Update README.md
Browse files
README.md
CHANGED
@@ -63,16 +63,15 @@ model-index:
|
|
63 |
_Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
|
64 |
we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
|
65 |
teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
|
|
|
66 |
As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
|
67 |
which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
|
68 |
-
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter)).
|
69 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the raining and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
70 |
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
|
75 |
-
|
76 |
|
77 |
- ***CER***
|
78 |
|
@@ -302,12 +301,8 @@ See [https://huggingface.co/distil-whisper/distil-large-v3#model-details](https:
|
|
302 |
|
303 |
|
304 |
## Evaluation
|
305 |
-
|
306 |
-
|
307 |
-
dataset with [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet), meaning no
|
308 |
-
audio data has to be downloaded to your local device.
|
309 |
-
|
310 |
-
First, we need to install the required packages, including 🤗 Datasets to stream and load the audio data, and 🤗 Evaluate to
|
311 |
perform the WER calculation:
|
312 |
|
313 |
```bash
|
@@ -326,6 +321,7 @@ from tqdm import tqdm
|
|
326 |
|
327 |
# config
|
328 |
model_id = "kotoba-tech/kotoba-whisper-v1.0"
|
|
|
329 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
330 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
331 |
audio_column = 'audio'
|
@@ -338,7 +334,7 @@ model.to(device)
|
|
338 |
processor = AutoProcessor.from_pretrained(model_id)
|
339 |
|
340 |
# load the dataset and sample the audio with 16kHz
|
341 |
-
dataset = load_dataset(
|
342 |
dataset = dataset.cast_column(audio_column, features.Audio(sampling_rate=processor.feature_extractor.sampling_rate))
|
343 |
dataset = dataset.select([0, 1, 2, 3, 4, 5, 6])
|
344 |
|
@@ -375,6 +371,13 @@ cer = 100 * cer_metric.compute(predictions=all_transcriptions, references=all_re
|
|
375 |
print(cer)
|
376 |
```
|
377 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
378 |
|
379 |
## Acknowledgements
|
380 |
* OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
|
|
|
63 |
_Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
|
64 |
we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
|
65 |
teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
|
66 |
+
|
67 |
As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
|
68 |
which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
|
69 |
+
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
|
70 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the raining and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
71 |
|
72 |
+
Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
|
73 |
+
from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
|
74 |
+
the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
|
|
|
|
|
75 |
|
76 |
- ***CER***
|
77 |
|
|
|
301 |
|
302 |
|
303 |
## Evaluation
|
304 |
+
The following code-snippets demonstrates how to evaluate the kotoba-whisper model on the Japanese subset of the CommonVoice 8.0.
|
305 |
+
First, we need to install the required packages, including 🤗 Datasets to load the audio data, and 🤗 Evaluate to
|
|
|
|
|
|
|
|
|
306 |
perform the WER calculation:
|
307 |
|
308 |
```bash
|
|
|
321 |
|
322 |
# config
|
323 |
model_id = "kotoba-tech/kotoba-whisper-v1.0"
|
324 |
+
dataset_name = "japanese-asr/ja_asr.common_voice_8_0"
|
325 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
326 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
327 |
audio_column = 'audio'
|
|
|
334 |
processor = AutoProcessor.from_pretrained(model_id)
|
335 |
|
336 |
# load the dataset and sample the audio with 16kHz
|
337 |
+
dataset = load_dataset(dataset_name, split="test")
|
338 |
dataset = dataset.cast_column(audio_column, features.Audio(sampling_rate=processor.feature_extractor.sampling_rate))
|
339 |
dataset = dataset.select([0, 1, 2, 3, 4, 5, 6])
|
340 |
|
|
|
371 |
print(cer)
|
372 |
```
|
373 |
|
374 |
+
The huggingface links to the major Japanese ASR datasets for evaluation are summarized at [here](https://huggingface.co/collections/japanese-asr/japanese-asr-evaluation-dataset-66051a03d6ca494d40baaa26).
|
375 |
+
For example, to evaluate the model on JSUT Basic5000, change the `dataset_name`:
|
376 |
+
|
377 |
+
```diff
|
378 |
+
- dataset_name = "japanese-asr/ja_asr.common_voice_8_0"
|
379 |
+
+ dataset_name = "japanese-asr/ja_asr.jsut_basic5000"
|
380 |
+
```
|
381 |
|
382 |
## Acknowledgements
|
383 |
* OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
|