This is a reformat of the official Fish Speech v1.5 weights to work with fish-speech.rs.

I've made the following changes, for better compatibility with Candle.rs and the HuggingFace ecosystem:

DualAR transformer weights converted to .safetensors for safety and easier loading
Tokenizer ported from Tiktoken format and custom wrapper to HuggingFace Tokenizers for easier downstream use
VQGAN is unchanged from v1.4, so copying the weight-norm merged safetensors and FireflyGAN config from my previous conversion

NOTE:

Please respect the original license and do not use this model for commercial purposes. You can support Fish Audio by using the official API at fish.audio.
These weights WILL NOT work with the official Fish Speech inference code!

ORIGINAL README: Fish Speech V1.5

Fish Speech V1.5 is a leading text-to-speech (TTS) model trained on more than 1 million hours of audio data in multiple languages.

Supported languages:

English (en) >300k hours
Chinese (zh) >300k hours
Japanese (ja) >100k hours
German (de) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Russian (ru) ~20k hours
Dutch (nl) <10k hours
Italian (it) <10k hours
Polish (pl) <10k hours
Portuguese (pt) <10k hours

Please refer to Fish Speech Github for more info.
Demo available at Fish Audio.

Citation

If you found this repository useful, please consider citing this work:

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis}, 
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156}, 
}

License

This model is permissively licensed under the BY-CC-NC-SA-4.0 license.