UtterTune

UtterTune is a low-rank adapter (LoRA) that enables segmantal pronunciation & prosody control on top of text-to-speech based on large language model architecture with no grapheme-to-phoneme modules.

This repo supports Japanese on CosyVoice 2 and provides LoRA weights only (no full model weights). The training data is derived from JSUT and JVS corpora.

How to use

See the UtterTune GitHub repository.

Static demo

https://shuheikatoinfo.github.io/UtterTune/

Input sentences for the sample files

# cv2_base.wav (CosyVoice 2)
ι­‘ι­…ι­ι­ŽγŒθ·‹ζ‰ˆγ™γ‚‹γ€‚

# cv2_base_kana.wav (CosyVoice 2)
γƒγƒŸγƒ’γƒΌγƒͺγƒ§γƒΌγŒγƒγƒƒγ‚³γ™γ‚‹γ€‚

# cv2_uttertune.wav (CosyVoice 2 + UtterTune)
<PHON_START>チ'γƒŸ/γƒ’γƒΌγƒͺョー<PHON_END>が<PHON_START>バ'ッコ<PHON_END>する。

Citation

If you use UtterTune in your research, please cite the paper:

@misc{Kato2025UtterTune,
  title={UtterTune: UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech},
  author={Kato, Shuhei},
  year={2025},
  howpublished={arXiv:2508.09767 [cs.CL]},
}
Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shuheikatoinfo/UtterTune-CosyVoice2-ja-JSUTJVS

Adapter
(1)
this model