UtterTune

UtterTune is a low-rank adapter (LoRA) that enables segmantal pronunciation & prosody control on top of text-to-speech based on large language model architecture with no grapheme-to-phoneme modules.

This repo supports Japanese on CosyVoice 2 and provides LoRA weights only (no full model weights). The training data is derived from JSUT and JVS corpora.

How to use

See the UtterTune GitHub repository.

Static demo

https://shuheikatoinfo.github.io/UtterTune/

Input sentences for the sample files

# cv2_base.wav (CosyVoice 2)
魑魅魍魎が跋扈する。

# cv2_base_kana.wav (CosyVoice 2)
チミモーリョーがバッコする。

# cv2_uttertune.wav (CosyVoice 2 + UtterTune)
<PHON_START>チ'ミ/モーリョー<PHON_END>が<PHON_START>バ'ッコ<PHON_END>する。

Citation

If you use UtterTune in your research, please cite the paper:

@misc{Kato2025UtterTune,
  title={UtterTune: UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech},
  author={Kato, Shuhei},
  year={2025},
  howpublished={arXiv:2508.09767 [cs.CL]},
}

shuheikatoinfo
/

UtterTune-CosyVoice2-ja-JSUTJVS

UtterTune

How to use

Static demo

Input sentences for the sample files

Citation

Model tree for shuheikatoinfo/UtterTune-CosyVoice2-ja-JSUTJVS