reazonspeech-k2-v2-ja-en

reazonspeech-k2-v2-ja-en is an automatic speech recognition (ASR) model trained on ReazonSpeech v2.0 corpus and LibriSpeech.

This model provides end-to-end Japanese and English speech recognition based on Next-gen Kaldi.

Model Architecture

Character-based RNN-T model.
This model utilizes an enhanced Transformer architecture called Zipformer.

Usage

We recommend implementing this model by using the reazonspeech library.

from reazonspeech.k2.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model(device="cpu", precision="fp32", language="ja-en") 
ret = transcribe(model, audio)
print(ret.text)

This model utilizes BBPE, so tokens for Japanese are represented by character sequences such as ▁ƊģŊ
While time stamps are associated with each transcribed token, these tokens are encoded on the byte-level and cannot be directly understood.
However, the English tokens are at a subword level printed in regular alphabetical text and can be directly understood.

Performance

This model was validated post training with the following results.

Word Error Rates (WERs) listed below:

Datasets	ReazonSpeech	ReazonSpeech	LibriSpeech	LibriSpeech
Zipformer WER (%)	dev	test	test-clean	test-other
greedy_search	5.9	4.07	3.46	8.35
modified_beam_search	4.87	3.61	3.28	8.07

Character Error Rates (CERs) for Japanese listed below:

Decoding Method	In-Distribution CER	JSUT	CommonVoice	TEDx
greedy search	12.56	6.93	9.75	9.67
modified beam search	11.59	6.97	9.55	9.51

Additional tests were performed with manually procurred audio files (see test_wavs/transcripts.txt).
The model performs reasonably well as long as the input audio contains a single language.
However when multiple languages are included in the same input, the model struggles to provide an accurate transcription (see test_multi).
This result can be avoided by properly segmenting audio into chunks, separated by pauses in speech.

test_ja_1: 57% (CER)
test_ja_2: 26% (CER)
test_multi: 99% (CER)
test_en_1: 12% (WER)
test_en_2: 27% (WER)

License

Apache Licence 2.0

reazon-research
/

reazonspeech-k2-v2-ja-en

reazonspeech-k2-v2-ja-en

Model Architecture

Usage

Performance

License

Collection including reazon-research/reazonspeech-k2-v2-ja-en

ReazonSpeech ASR