Whisper Large V3 Japanese Phone Accent

This is a Whisper model designed to transcribe Japanese speech into Katakana with pitch accent annotations. The model is built upon the whisper-large-v3-turbo and has been fine-tuned using a subset (1/20) of the Galgame-Speech dataset, as well as the jsut-5000 dataset.

Training Data:

  • Stage 1: Audio from the Galgame-Speech dataset was used. The text was converted into Katakana sequences with pitch accent annotations using pyopenjtalk.
  • Stage 2: JSUT-5000 dataset, using its original training set with pitch accent annotations. The data was split into 90% for training and 10% for evaluation.

Evaluation Results:

  • The model achieved a CER (Character Error Rate) of approximately 4% on the JSUT-5000 test set, which is an improvement over the 7% CER of pyopenjtalk.
  • Training only with Stage 1 resulted in a CER of 13%, with errors including specific misreadings and misclassification between on'yomi (音読) and kun'yomi (訓読) readings. This was improved in Stage 2.

We are currently seeking Japanese pitch accent annotated datasets. If you have such data, please reach out!

Downloads last month
86
Safetensors
Model size
809M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for AkitoP/whisper-large-v3-japense-phone_accent

Finetuned
(125)
this model

Datasets used to train AkitoP/whisper-large-v3-japense-phone_accent

Space using AkitoP/whisper-large-v3-japense-phone_accent 1