Tsukasa 司 Speech: Engineering the Naturalness and Rich Expressiveness

tl;dr : I made a very cool japanese speech generation model.

if the demo didn't work and you just want to listen to some samples, take a look at this notebook. (ps. this belongs to a much earlier checkpoint, not representative of the model at its best.)

Try chatting with Aira, a mini-project I did by using various Tech, including Tsukasa. (maybe not very optimized, but hey, it works!)

日本語のモデルカードはこちら。

Part of a personal project, focusing on further advancing Japanese speech field.

Use the HuggingFace Space for Tsukasa (24khz):
~~HuggingFace Space for Tsumugi (48khz):~~
Join Shoukan lab's discord server, a comfy place I frequently visit ->

Github's repo:

What is this?

Note: This model only supports the Japanese language; ~~but you can feed it Romaji if you use the Gradio demo.~~ (no longer, due to resource constraints, but the Tech is there.)

This is a speech generation network, aimed at maximizing the expressiveness and Controllability of the generated speech. at its core it uses StyleTTS 2's architecture with the following changes:

Incorporating mLSTM Layers instead of regular PyTorch LSTM layers, and increasing the capacity of the text and prosody encoder by using a higher number of parameters
Retrained PL-Bert, Pitch Extractor, Text Aligner from scratch
Whisper's Encoder instead of WavLM for the SLM
48khz Config
improved Performance on non-verbal sounds and cues. such as sigh, pauses, etc. and also very slightly on laughter (depends on the speaker)
a new way of sampling the Style Vectors.
Promptable Speech Synthesizing.
a Smart Phonemization algorithm that can handle Romaji inputs or a mixture of Japanese and Romaji.
Fixed DDP and BF16 Training (mostly!)

There are two checkpoints you can use. Tsukasa & Tsumugi 48khz (placeholder).

Tsukasa was trained on ~800 hours of studio grade, high quality data. sourced mainly from games and novels, part of it from a private dataset. So the Japanese is going to be the "anime japanese" (it's different than what people usually speak in real-life.)

Brought to you by:

Special thanks to Yinghao Aaron Li, the Author of StyleTTS which this work is based on top of that.
He is one of the most talented Engineers I've ever seen in this field. Also Karesto and Raven(a.k.a hexgrad) for their help in debugging some of the scripts. wonderful people.

Why does it matter?

Recently, there's a big trend towards larger models, increasing the scale. We're going the opposite way, trying to see how far we can push the limits by utilizing existing tools. Maybe, just maybe, scale is not necessarily the answer.

There's also a few things that's related to Japanese (but can have a wider impact on languages that face a similar issue like Arabic). such as how we can improve the intonations for this language. what can be done to accurately annotate a text that can have various spellings depending on the context, etc.

How to do ...

Pre-requisites

Python >= 3.11
Clone this repository:

git clone https://huggingface.co/Respair/Tsukasa_Speech
cd Tsukasa_Speech

Install python requirements:

pip install -r requirements.txt

Inference:

Gradio demo:

python app_tsuka.py

or check the inference notebook. before that, make sure you read the Important Notes section down below.

Training:

Before starting remove lines 985 and 986 from models.py also remove "KotoDama_Prompt, KotoDama_Text" from the "build_model" function's parameters.

First stage training:

accelerate launch train_first.py --config_path ./Configs/config.yml

Second stage training:

accelerate launch accelerate_train_second.py --config_path ./Configs/config.yml

SLM Joint-Training doesn't work on multigpu. (you don't need it, i didn't use it too.)

or:

launch train_first.py --config_path ./Configs/config.yml

Third stage training (Kotodama, prompt encoding, etc.):

not planned right now, due to some constraints, but feel free to replicate.

some ideas for future

I can think of a few things that can be improved, not nessarily by me, treat it as some sorts of suggestions:

[o] changing the decoder (fregrad looks promising)
[o] retraining the Pitch Extractor using a different algorithm
[o] while the quality of non-speech sounds have been improved, it cannot generate an entirely non-speech output, perhaps because of the hard alignement.
[o] using the Style encoder as another modality in LLMs, since they have a detailed representation of the tone and expression of a speech (similar to Style-Talker).

Pre-requisites

Python >= 3.11
Clone this repository:

git clone https://huggingface.co/Respair/Tsukasa_Speech
cd Tsukasa_Speech

Install python requirements:

pip install -r requirements.txt

Training details

8x A40s + 2x V100s(32gb each)
750 ~ 800 hours of data
Bfloat16
Approximately 3 weeks of training, overall 3 months including the work spent on the data pipeline.
Roughly 66.6 kg of CO2eq. of Carbon emitted if we base it on Google Cloud. (I didn't use Google, but the cluster is located in US, please treat it as a very rough approximation.)

Important Notes

Check here

Any questions?

[email protected]

or simply DM me on discord.

Some cool projects:

Kokoro - a very nice and light weight TTS, based on StyleTTS. supports Japanese and English.
VoPho - a meta phonemizer to rule them all. it will automatically handle any languages with hand-picked high quality phonemizers.

References

yl4579/StyleTTS2
NX-AI/xlstm
archinetai/audio-diffusion-pytorch
jik876/hifi-gan
rishikksh20/iSTFTNet-pytorch
nii-yamagishilab/project-NN-Pytorch-scripts/project/01-nsf
litain's Moe Speech a very cool dataset you can use in case i couldn't release mine

@article{xlstm,
  title={xLSTM: Extended Long Short-Term Memory},
  author={Beck, Maximilian and P{\"o}ppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, G{\"u}nter and Brandstetter, Johannes and Hochreiter, Sepp},
  journal={arXiv preprint arXiv:2405.04517},
  year={2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Respair
/

Tsukasa_Speech

Tsukasa 司 Speech: Engineering the Naturalness and Rich Expressiveness

What is this?

Why does it matter?

How to do ...

Pre-requisites

Inference:

Training:

some ideas for future

Pre-requisites

Training details

Important Notes

Some cool projects:

References

Space using Respair/Tsukasa_Speech 1