CSM English Multi Speaker V1

This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b repository. This model is trained using dataset which mentioned below.

Datasets

  1. Expresso dataset (available at https://huggingface.co/datasets/ylacombe/expresso): 4 speakers (2 female and 2 male)

Models

  • csm-english-multi-speaker-v1: This is a multi-speaker model trained on Expresso dataset.

Speakers

To control the generated speaker's voice during inference, use the following speaker_id values:

  • 0: expresso male speaker 1
  • 1: expresso female speaker 1
  • 2: expresso male speaker 2
  • 3: expresso female speaker 2

Sample Examples

Speaker Text Synthesized Audio
0 They decided to take a short break from work and travel to the mountains.
0 I think that movie had a very unexpected and thrilling ending.
0 It's always a good idea to double-check your work before submitting it.
1 They decided to take a short break from work and travel to the mountains.
1 I think that movie had a very unexpected and thrilling ending.
1 It's always a good idea to double-check your work before submitting it.
2 They decided to take a short break from work and travel to the mountains.
2 I think that movie had a very unexpected and thrilling ending.
2 It's always a good idea to double-check your work before submitting it.
3 They decided to take a short break from work and travel to the mountains.
3 I think that movie had a very unexpected and thrilling ending.
3 It's always a good idea to double-check your work before submitting it.

Inference

First, install the necessary libraries:

!pip install unsloth
!pip install transformers==4.52.3

Then, run inference with the following code-snippet:

from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio

model, processor = FastModel.from_pretrained(
    model_name = "onecxi/csm-english-multi-speaker-v1",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Leave as None for auto-detection
    auto_model = CsmForConditionalGeneration,
    load_in_4bit = False, # Select True for 4bit - reduces memory usage
)

text = "We just finished fine-tuning a text to speech model."
speaker_id = 0

inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
    **inputs,
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()

Audio(audio, rate=24000)

Disclaimer

This Text-to-Speech (TTS) model is intended solely for research and educational use. Any use of the model must comply with all applicable laws, regulations, and ethical standards. The unauthorized use of this model for impersonating real individuals without their explicit consent is strictly prohibited. Additionally, the model must not be used to create or distribute deceptive, misleading, or fraudulent content, including but not limited to fake news or scams. Any use of the model for illegal, harmful, or malicious purposes is expressly forbidden.

By using this model, you acknowledge and agree to these terms. The creators and distributors of the model disclaim any liability for misuse and do not support or condone unethical or unlawful applications.

Downloads last month
35
Safetensors
Model size
1.65B params
Tensor type
F32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for onecxi/csm-english-multi-speaker-v1

Base model

sesame/csm-1b
Finetuned
(21)
this model

Dataset used to train onecxi/csm-english-multi-speaker-v1

Collection including onecxi/csm-english-multi-speaker-v1