metadata

library_name: transformers
datasets:
  - reazon-research/reazonspeech
language:
  - ja
metrics:
  - cer
base_model:
  - rinna/japanese-hubert-base
pipeline_tag: automatic-speech-recognition

VAD-less Japanese ASR Model

This model is a Japanese speech recognition model, fine-tuned from rinna/japanese-hubert-base for real-time Automatic Speech Recognition (ASR) in noisy environments.
Its main feature is a "VAD-less" architecture, which does not require a separate Voice Activity Detection (VAD) step.
It can explicitly recognize and output non-speech segments (noise or silence) included in the audio input as special tokens: [雑音] (noise) or [無音] (silence).
This aims to build a highly real-time speech recognition system by omitting the preceding VAD process.

The model was fine-tuned on the medium set (approx. 1,000 hours) of the ReazonSpeech v2.0 corpus. The training data was created using a unique method: concatenating two audio clips, intentionally inserting non-speech segments between them and at the end, and then adding noise (babble and pink noise) from the NOISEX-92 dataset.

VADレス日本語音声認識モデル

このモデルは、rinna/japanese-hubert-baseをベースモデルとし、雑音環境下でのリアルタイム自動音声認識（ASR）のためにファインチューニングされた日本語音声認識モデルです。

このモデルの主な特徴は、個別の音声活動検出（VAD）処理を必要としない「VADレス」アーキテクチャです。音声入力に含まれる非発話区間（雑音や無音）を、[雑音]や[無音]といった特別なトークンとして明示的に認識し、出力することができます。これにより、事前のVAD処理を省略し、リアルタイム性の高い音声認識システムの構築を目指しています。

モデルのファインチューニングには、ReazonSpeech v2.0コーパスのmediumセット（約1,000時間）が使用されました。学習データは、2つの音声クリップを結合し、その間と末尾に非発話区間を挿入した上で、NOISEX-92データセットのノイズ（バブルノイズとピンクノイズ）を重畳する手法で作成されています。

Model Details

Base Model: rinna/japanese-hubert-base
Fine-tuning Strategy: Connectionist Temporal Classification (CTC)
Framework: Transformers
Sampling Rate: 16,000 Hz
Output Vocabulary: A vocabulary of 3,200 unigram generated by sentencepiece, include 3 special tokens ([雑音], [無音], [PAD]).

Evaluation Result (CER %)

	Clean	50 dB	20 dB	15 dB	10 dB	5 dB	0 dB
JNAS	10.61	10.57	10.16	10.10	10.43	11.64	15.95
ReazonSpeech	12.54	12.51	12.69	12.83	13.43	14.81	19.87

Citation

@inproceedings{emoto2025development,
  title={雑音環境下でのリアルタイムVADレス音声認識モデルの構築と他モデルとの比較 (Development of a real-time VAD-less speech recognition model in noisy environments and comparison with other models)},
  author={Emoto Jotaro and Nishimura Ryota and Ohta Kengo and Kitaoka Norihide},
  booktitle={Proc. Spring Meet. Acoust. Soc. Jpn.},
  year={2025},
}

@inproceedings{emoto2024development,
  title={雑音・無音棄却型リアルタイムVADレス音声認識モデルの開発 (Development of Noise and Silence Rejection Real-time VAD-less Speech Recognition Model)},
  author={Emoto Jotaro and Nishimura Ryota and Ohta Kengo and Kitaoka Norihide},
  booktitle={Proc. Spring Meet. Acoust. Soc. Jpn.},
  year={2024}
}