---
library_name: transformers
datasets:
- reazon-research/reazonspeech
language:
- ja
metrics:
- cer
base_model:
- rinna/japanese-hubert-base
pipeline_tag: automatic-speech-recognition
---

# VAD-less Japanese ASR Model
This model is a Japanese speech recognition model, fine-tuned from `rinna/japanese-hubert-base` for real-time Automatic Speech Recognition (ASR) in noisy environments.  
Its main feature is a "VAD-less" architecture, which does not require a separate Voice Activity Detection (VAD) step.  
It can explicitly recognize and output non-speech segments (noise or silence) included in the audio input as special tokens: `[雑音]` (noise) or `[無音]` (silence).  
This aims to build a highly real-time speech recognition system by omitting the preceding VAD process.

The model was fine-tuned on the **medium** set (approx. 1,000 hours) of the ReazonSpeech v2.0 corpus. 
The training data was created using a unique method: concatenating two audio clips, intentionally inserting non-speech segments between them and at the end, and then adding noise (babble and pink noise) from the NOISEX-92 dataset.

# VADレス日本語音声認識モデル
このモデルは、rinna/japanese-hubert-baseをベースモデルとし、雑音環境下でのリアルタイム自動音声認識（ASR）のためにファインチューニングされた日本語音声認識モデルです 。

このモデルの主な特徴は、個別の音声活動検出（VAD）処理を必要としない「VADレス」アーキテクチャです。
音声入力に含まれる非発話区間（雑音や無音）を、[雑音]や[無音]といった特別なトークンとして明示的に認識し、出力することができます。
これにより、事前のVAD処理を省略し、リアルタイム性の高い音声認識システムの構築を目指しています 。

モデルのファインチューニングには、ReazonSpeech v2.0コーパスのmediumセット（約1,000時間）が使用されました。
学習データは、2つの音声クリップを結合し、その間と末尾に非発話区間を挿入した上で、NOISEX-92データセットのノイズ（バブルノイズとピンクノイズ）を重畳する手法で作成されています 。

## Model Details
- **Base Model:** [rinna/japanese-hubert-base](https://huggingface.co/rinna/japanese-hubert-base)
- **Fine-tuning Strategy:** Connectionist Temporal Classification (CTC) 
- **Framework:** Transformers
- **Sampling Rate:** 16,000 Hz 
- **Output Vocabulary:** A vocabulary of 3,200 unigram generated by sentencepiece, include 3 special tokens (`[雑音]`, `[無音]`, `[PAD]`).

## Evaluation Result (CER %)
|            |	Clean |	50 dB | 20 dB | 15 dB | 10 dB | 5 dB  | 0 dB  |
|:---        |   ---: | ---:  | ---:  | ---:  | ---:  | ---:  | ---:  |
|JNAS        |	10.61 | 10.57 | 10.16 | 10.10 | 10.43 | 11.64 | 15.95 |
|ReazonSpeech|	12.54 | 12.51 | 12.69 | 12.83 | 13.43 | 14.81 | 19.87 |


### Citation
```
@inproceedings{emoto2025development,
  title={雑音環境下でのリアルタイムVADレス音声認識モデルの構築と他モデルとの比較 (Development of a real-time VAD-less speech recognition model in noisy environments and comparison with other models)},
  author={Emoto Jotaro and Nishimura Ryota and Ohta Kengo and Kitaoka Norihide},
  booktitle={Proc. Spring Meet. Acoust. Soc. Jpn.},
  year={2025},
}

@inproceedings{emoto2024development,
  title={雑音・無音棄却型リアルタイムVADレス音声認識モデルの開発 (Development of Noise and Silence Rejection Real-time VAD-less Speech Recognition Model)},
  author={Emoto Jotaro and Nishimura Ryota and Ohta Kengo and Kitaoka Norihide},
  booktitle={Proc. Spring Meet. Acoust. Soc. Jpn.},
  year={2024}
}
```