reazonspeech-k2-v2-ja-en
reazonspeech-k2-v2-ja-en
is an automatic speech recognition (ASR) model
trained on ReazonSpeech v2.0 corpus and LibriSpeech.
This model provides end-to-end Japanese and English speech recognition based on Next-gen Kaldi.
Model Architecture
Character-based RNN-T model.
This model utilizes an enhanced Transformer architecture called Zipformer.
Usage
We recommend implementing this model by using the reazonspeech library.
from reazonspeech.k2.asr import load_model, transcribe, audio_from_path
audio = audio_from_path("speech.wav")
model = load_model(device="cpu", precision="fp32", language="ja-en")
ret = transcribe(model, audio)
print(ret.text)
This model utilizes BBPE, so tokens for Japanese are represented by character sequences such as ▁ƊģŊ
While time stamps are associated with each transcribed token, these tokens are encoded on the byte-level and cannot be directly understood.
However, the English tokens are at a subword level printed in regular alphabetical text and can be directly understood.
Performance
This model was validated post training with the following results.
Word Error Rates (WERs) listed below:
Datasets | ReazonSpeech | ReazonSpeech | LibriSpeech | LibriSpeech |
---|---|---|---|---|
Zipformer WER (%) | dev | test | test-clean | test-other |
greedy_search | 5.9 | 4.07 | 3.46 | 8.35 |
modified_beam_search | 4.87 | 3.61 | 3.28 | 8.07 |
Character Error Rates (CERs) for Japanese listed below:
Decoding Method | In-Distribution CER | JSUT | CommonVoice | TEDx |
---|---|---|---|---|
greedy search | 12.56 | 6.93 | 9.75 | 9.67 |
modified beam search | 11.59 | 6.97 | 9.55 | 9.51 |
Additional tests were performed with manually procurred audio files (see test_wavs/transcripts.txt).
The model performs reasonably well as long as the input audio contains a single language.
However when multiple languages are included in the same input, the model struggles to provide an accurate transcription (see test_multi).
This result can be avoided by properly segmenting audio into chunks, separated by pauses in speech.
- test_ja_1: 57% (CER)
- test_ja_2: 26% (CER)
- test_multi: 99% (CER)
- test_en_1: 12% (WER)
- test_en_2: 27% (WER)