UtterTune
UtterTune is a low-rank adapter (LoRA) that enables segmantal pronunciation & prosody control on top of text-to-speech based on large language model architecture with no grapheme-to-phoneme modules.
This repo supports Japanese on CosyVoice 2 and provides LoRA weights only (no full model weights). The training data is derived from JSUT and JVS corpora.
How to use
See the UtterTune GitHub repository.
Static demo
https://shuheikatoinfo.github.io/UtterTune/
Input sentences for the sample files
# cv2_base.wav (CosyVoice 2)
ιι
ιιγθ·ζγγγ
# cv2_base_kana.wav (CosyVoice 2)
γγγ’γΌγͺγ§γΌγγγγ³γγγ
# cv2_uttertune.wav (CosyVoice 2 + UtterTune)
<PHON_START>γ'γ/γ’γΌγͺγ§γΌ<PHON_END>γ<PHON_START>γ'γγ³<PHON_END>γγγ
Citation
If you use UtterTune in your research, please cite the paper:
@misc{Kato2025UtterTune,
title={UtterTune: UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech},
author={Kato, Shuhei},
year={2025},
howpublished={arXiv:2508.09767 [cs.CL]},
}
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for shuheikatoinfo/UtterTune-CosyVoice2-ja-JSUTJVS
Base model
FunAudioLLM/CosyVoice2-0.5B