Indic CSM TTS
Collection
3 items
•
Updated
This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b
repository. This model is trained using dataset which mentioned below.
To control the generated speaker's voice during inference, use the following speaker_id
values:
Text | Synthesized Audio |
---|---|
They decided to take a short break from work and travel to the mountains. | |
I think that movie had a very unexpected and thrilling ending. | |
It's always a good idea to double-check your work before submitting it. |
First, install the necessary libraries:
!pip install unsloth
!pip install transformers==4.52.3
Then, run inference with the following code-snippet:
from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio
model, processor = FastModel.from_pretrained(
model_name = "onecxi/csm-english-jenny",
max_seq_length= 2048, # Choose any for long context!
dtype = None, # Leave as None for auto-detection
auto_model = CsmForConditionalGeneration,
load_in_4bit = False, # Select True for 4bit - reduces memory usage
)
text = "We just finished fine-tuning a text to speech model."
speaker_id = 0
inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
**inputs,
max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
# play with these parameters to get the best results
depth_decoder_temperature=0.6,
depth_decoder_top_k=0,
depth_decoder_top_p=0.9,
temperature=0.8,
top_k=50,
top_p=1.0,
#########################################################
output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
Audio(audio, rate=24000)
Base model
sesame/csm-1b