CSM English Multi Speaker V1

This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b repository. This model is trained using dataset which mentioned below.

Datasets

Expresso dataset (available at https://huggingface.co/datasets/ylacombe/expresso): 4 speakers (2 female and 2 male)

Models

csm-english-multi-speaker-v1: This is a multi-speaker model trained on Expresso dataset.

Speakers

To control the generated speaker's voice during inference, use the following speaker_id values:

0: expresso male speaker 1
1: expresso female speaker 1
2: expresso male speaker 2
3: expresso female speaker 2

Sample Examples

Speaker	Text	Synthesized Audio
0	They decided to take a short break from work and travel to the mountains.
0	I think that movie had a very unexpected and thrilling ending.
0	It's always a good idea to double-check your work before submitting it.
1	They decided to take a short break from work and travel to the mountains.
1	I think that movie had a very unexpected and thrilling ending.
1	It's always a good idea to double-check your work before submitting it.
2	They decided to take a short break from work and travel to the mountains.
2	I think that movie had a very unexpected and thrilling ending.
2	It's always a good idea to double-check your work before submitting it.
3	They decided to take a short break from work and travel to the mountains.
3	I think that movie had a very unexpected and thrilling ending.
3	It's always a good idea to double-check your work before submitting it.

Inference

First, install the necessary libraries:

!pip install unsloth
!pip install transformers==4.52.3

Then, run inference with the following code-snippet:

from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio

model, processor = FastModel.from_pretrained(
    model_name = "onecxi/csm-english-multi-speaker-v1",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Leave as None for auto-detection
    auto_model = CsmForConditionalGeneration,
    load_in_4bit = False, # Select True for 4bit - reduces memory usage
)

text = "We just finished fine-tuning a text to speech model."
speaker_id = 0

inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
    **inputs,
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()

Audio(audio, rate=24000)

Disclaimer

This Text-to-Speech (TTS) model is intended solely for research and educational use. Any use of the model must comply with all applicable laws, regulations, and ethical standards. The unauthorized use of this model for impersonating real individuals without their explicit consent is strictly prohibited. Additionally, the model must not be used to create or distribute deceptive, misleading, or fraudulent content, including but not limited to fake news or scams. Any use of the model for illegal, harmful, or malicious purposes is expressly forbidden.

By using this model, you acknowledge and agree to these terms. The creators and distributors of the model disclaim any liability for misuse and do not support or condone unethical or unlawful applications.

onecxi
/

csm-english-multi-speaker-v1

CSM English Multi Speaker V1

Datasets

Models

Speakers

Sample Examples

Inference

Disclaimer

Model tree for onecxi/csm-english-multi-speaker-v1

Dataset used to train onecxi/csm-english-multi-speaker-v1

Collection including onecxi/csm-english-multi-speaker-v1

Indic CSM TTS