CSM English Multi Speaker V1
This repository contains fine-tuned models of CSM-TTS with unsloth. You can find the fine-tuning notebook in unsloth/csm-1b
repository. This model is trained using dataset which mentioned below.
Datasets
- Expresso dataset (available at https://huggingface.co/datasets/ylacombe/expresso): 4 speakers (2 female and 2 male)
Models
- csm-english-multi-speaker-v1: This is a multi-speaker model trained on Expresso dataset.
Speakers
To control the generated speaker's voice during inference, use the following speaker_id
values:
- 0: expresso male speaker 1
- 1: expresso female speaker 1
- 2: expresso male speaker 2
- 3: expresso female speaker 2
Sample Examples
Speaker | Text | Synthesized Audio |
---|---|---|
0 | They decided to take a short break from work and travel to the mountains. | |
0 | I think that movie had a very unexpected and thrilling ending. | |
0 | It's always a good idea to double-check your work before submitting it. | |
1 | They decided to take a short break from work and travel to the mountains. | |
1 | I think that movie had a very unexpected and thrilling ending. | |
1 | It's always a good idea to double-check your work before submitting it. | |
2 | They decided to take a short break from work and travel to the mountains. | |
2 | I think that movie had a very unexpected and thrilling ending. | |
2 | It's always a good idea to double-check your work before submitting it. | |
3 | They decided to take a short break from work and travel to the mountains. | |
3 | I think that movie had a very unexpected and thrilling ending. | |
3 | It's always a good idea to double-check your work before submitting it. |
Inference
First, install the necessary libraries:
!pip install unsloth
!pip install transformers==4.52.3
Then, run inference with the following code-snippet:
from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch
from IPython.display import Audio
model, processor = FastModel.from_pretrained(
model_name = "onecxi/csm-english-multi-speaker-v1",
max_seq_length= 2048, # Choose any for long context!
dtype = None, # Leave as None for auto-detection
auto_model = CsmForConditionalGeneration,
load_in_4bit = False, # Select True for 4bit - reduces memory usage
)
text = "We just finished fine-tuning a text to speech model."
speaker_id = 0
inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True, return_tensors="pt").to("cuda")
audio_values = model.generate(
**inputs,
max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
# play with these parameters to get the best results
depth_decoder_temperature=0.6,
depth_decoder_top_k=0,
depth_decoder_top_p=0.9,
temperature=0.8,
top_k=50,
top_p=1.0,
#########################################################
output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
Audio(audio, rate=24000)
Disclaimer
This Text-to-Speech (TTS) model is intended solely for research and educational use. Any use of the model must comply with all applicable laws, regulations, and ethical standards. The unauthorized use of this model for impersonating real individuals without their explicit consent is strictly prohibited. Additionally, the model must not be used to create or distribute deceptive, misleading, or fraudulent content, including but not limited to fake news or scams. Any use of the model for illegal, harmful, or malicious purposes is expressly forbidden.
By using this model, you acknowledge and agree to these terms. The creators and distributors of the model disclaim any liability for misuse and do not support or condone unethical or unlawful applications.
- Downloads last month
- 35
Model tree for onecxi/csm-english-multi-speaker-v1
Base model
sesame/csm-1b