CSM English Jenny

This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b repository. This model is trained using dataset which mentioned below.

Datasets

Jenny dataset (available at https://huggingface.co/datasets/reach-vb/jenny_tts_dataset): 1 female speaker

Models

csm-english-jenny: This is a single female speaker model trained on Jenny dataset.

Speakers

To control the generated speaker's voice during inference, use the following speaker_id values:

0: jenny female speaker

Sample Examples

Text	Synthesized Audio
They decided to take a short break from work and travel to the mountains.
I think that movie had a very unexpected and thrilling ending.
It's always a good idea to double-check your work before submitting it.

Inference

First, install the necessary libraries:

!pip install unsloth
!pip install transformers==4.52.3

Then, run inference with the following code-snippet:

from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio

model, processor = FastModel.from_pretrained(
    model_name = "onecxi/csm-english-jenny",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Leave as None for auto-detection
    auto_model = CsmForConditionalGeneration,
    load_in_4bit = False, # Select True for 4bit - reduces memory usage
)

text = "We just finished fine-tuning a text to speech model."
speaker_id = 0

inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
    **inputs,
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()

Audio(audio, rate=24000)

onecxi
/

csm-english-jenny

CSM English Jenny

Datasets

Models

Speakers

Sample Examples

Inference

Model tree for onecxi/csm-english-jenny

Dataset used to train onecxi/csm-english-jenny

Collection including onecxi/csm-english-jenny

Indic CSM TTS