reazonspeech-k2-v2-ja-en

reazonspeech-k2-v2-ja-en is an automatic speech recognition (ASR) model trained on ReazonSpeech v2.0 corpus and LibriSpeech.

This model provides end-to-end Japanese and English speech recognition based on Next-gen Kaldi.

Model Architecture

  • Character-based RNN-T model.

  • This model utilizes an enhanced Transformer architecture called Zipformer.

Usage

We recommend implementing this model by using the reazonspeech library.

from reazonspeech.k2.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model(device="cpu", precision="fp32", language="ja-en") 
ret = transcribe(model, audio)
print(ret.text)

This model utilizes BBPE, so tokens for Japanese are represented by character sequences such as ▁ƊģŊ
While time stamps are associated with each transcribed token, these tokens are encoded on the byte-level and cannot be directly understood.
However, the English tokens are at a subword level printed in regular alphabetical text and can be directly understood.

Performance

This model was validated post training with the following results.

Word Error Rates (WERs) listed below:

Datasets ReazonSpeech ReazonSpeech LibriSpeech LibriSpeech
Zipformer WER (%) dev test test-clean test-other
greedy_search 5.9 4.07 3.46 8.35
modified_beam_search 4.87 3.61 3.28 8.07

Character Error Rates (CERs) for Japanese listed below:

Decoding Method In-Distribution CER JSUT CommonVoice TEDx
greedy search 12.56 6.93 9.75 9.67
modified beam search 11.59 6.97 9.55 9.51

Additional tests were performed with manually procurred audio files (see test_wavs/transcripts.txt).
The model performs reasonably well as long as the input audio contains a single language.
However when multiple languages are included in the same input, the model struggles to provide an accurate transcription (see test_multi).
This result can be avoided by properly segmenting audio into chunks, separated by pauses in speech.

  • test_ja_1: 57% (CER)
  • test_ja_2: 26% (CER)
  • test_multi: 99% (CER)
  • test_en_1: 12% (WER)
  • test_en_2: 27% (WER)

License

Apache Licence 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Collection including reazon-research/reazonspeech-k2-v2-ja-en