CSM English Multi Speaker V2

This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b repository. This model is trained using datasets which mentioned below.

Datasets

Expresso dataset (available at https://huggingface.co/datasets/ylacombe/expresso): 4 speakers (2 female and 2 male)
Jenny dataset (available at https://huggingface.co/datasets/reach-vb/jenny_tts_dataset): 1 female speaker

Models

csm-english-multi-speaker-v2: This is a multi-speaker model trained by combining both the Expresso and Jenny datasets.

Speakers

To control the generated speaker's voice during inference, use the following speaker_id values:

0: expresso male speaker 1
1: expresso female speaker 1
2: expresso male speaker 2
3: expresso female speaker 2
5: jenny female speaker

Inference

First, install the necessary libraries:

!pip install unsloth
!pip install transformers==4.52.3

Then, run inference with the following code-snippet:

from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio

model, processor = FastModel.from_pretrained(
    model_name = "onecxi/csm-english-multi-speaker-v2",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Leave as None for auto-detection
    auto_model = CsmForConditionalGeneration,
    load_in_4bit = False, # Select True for 4bit - reduces memory usage
)

text = "We just finished fine-tuning a text to speech model."
speaker_id = 0

inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
    **inputs,
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()

Audio(audio, rate=24000)

onecxi
/

csm-english-multi-speaker-v2

CSM English Multi Speaker V2

Datasets

Models

Speakers

Inference

Model tree for onecxi/csm-english-multi-speaker-v2

Datasets used to train onecxi/csm-english-multi-speaker-v2

Collection including onecxi/csm-english-multi-speaker-v2

Indic CSM TTS