Indic CSM TTS
Collection
3 items
•
Updated
This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b
repository. This model is trained using datasets which mentioned below.
Expresso dataset (available at https://huggingface.co/datasets/ylacombe/expresso): 4 speakers (2 female and 2 male)
Jenny dataset (available at https://huggingface.co/datasets/reach-vb/jenny_tts_dataset): 1 female speaker
To control the generated speaker's voice during inference, use the following speaker_id
values:
First, install the necessary libraries:
!pip install unsloth
!pip install transformers==4.52.3
Then, run inference with the following code-snippet:
from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio
model, processor = FastModel.from_pretrained(
model_name = "onecxi/csm-english-multi-speaker-v2",
max_seq_length= 2048, # Choose any for long context!
dtype = None, # Leave as None for auto-detection
auto_model = CsmForConditionalGeneration,
load_in_4bit = False, # Select True for 4bit - reduces memory usage
)
text = "We just finished fine-tuning a text to speech model."
speaker_id = 0
inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
**inputs,
max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
# play with these parameters to get the best results
depth_decoder_temperature=0.6,
depth_decoder_top_k=0,
depth_decoder_top_p=0.9,
temperature=0.8,
top_k=50,
top_p=1.0,
#########################################################
output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
Audio(audio, rate=24000)
Base model
sesame/csm-1b