CSM English Multi Speaker V2

This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b repository. This model is trained using datasets which mentioned below.

Datasets

  1. Expresso dataset (available at https://huggingface.co/datasets/ylacombe/expresso): 4 speakers (2 female and 2 male)

  2. Jenny dataset (available at https://huggingface.co/datasets/reach-vb/jenny_tts_dataset): 1 female speaker

Models

  • csm-english-multi-speaker-v2: This is a multi-speaker model trained by combining both the Expresso and Jenny datasets.

Speakers

To control the generated speaker's voice during inference, use the following speaker_id values:

  • 0: expresso male speaker 1
  • 1: expresso female speaker 1
  • 2: expresso male speaker 2
  • 3: expresso female speaker 2
  • 5: jenny female speaker

Inference

First, install the necessary libraries:

!pip install unsloth
!pip install transformers==4.52.3

Then, run inference with the following code-snippet:

from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio

model, processor = FastModel.from_pretrained(
    model_name = "onecxi/csm-english-multi-speaker-v2",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Leave as None for auto-detection
    auto_model = CsmForConditionalGeneration,
    load_in_4bit = False, # Select True for 4bit - reduces memory usage
)

text = "We just finished fine-tuning a text to speech model."
speaker_id = 0

inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
    **inputs,
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()

Audio(audio, rate=24000)
Downloads last month
0
Safetensors
Model size
1.65B params
Tensor type
F32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for onecxi/csm-english-multi-speaker-v2

Base model

sesame/csm-1b
Finetuned
(19)
this model

Datasets used to train onecxi/csm-english-multi-speaker-v2

Collection including onecxi/csm-english-multi-speaker-v2