CSM-1B Indonesian Fine-tuned TTS Model

  • Developed by: Ellbendls
  • License: apache-2.0
  • Finetuned from model: unsloth/csm-1b
  • Language: Indonesian (Bahasa Indonesia)
  • Dataset: octava/indonesian-voice-transcription-1.1.85

Model Description

This is a fine-tuned CSM (Conversational Speech Model) for Indonesian Text-to-Speech (TTS) generation. The model has been adapted from the original CSM-1B to generate natural-sounding Indonesian speech from text input.

Key Features:

  • Multi-speaker support (81 unique speakers from the training dataset)
  • High-quality 24kHz audio output
  • Fast inference with optimized architecture
  • Support for various Indonesian text inputs

This CSM model was trained 2x faster with Unsloth and Huggingface's TRL library.

Usage

Installation

pip install transformers torch soundfile ipython

Basic Usage

from transformers import CsmForConditionalGeneration, AutoProcessor
import torch
from IPython.display import Audio, display
import soundfile as sf

# Load model dan processor
model = CsmForConditionalGeneration.from_pretrained("Ellbendls/csm-1b-indonesian-fine-tuned")
processor = AutoProcessor.from_pretrained("Ellbendls/csm-1b-indonesian-fine-tuned")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate audio
text = "Selamat pagi, nama saya adalah Budi. Bagaimana kabar Anda hari ini?"
speaker_id = 0  # Use speaker ID 0-80

inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True).to(device)
audio_values = model.generate(
    **inputs,
    max_new_tokens=125,  # ~10 seconds of audio
    output_audio=True
)

# Convert and save audio
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("indonesian_tts_output.wav", audio, 24000)
display(Audio(audio, rate=24000))

Advanced Usage with Quality Control

# Generate with custom parameters for better quality
audio_values = model.generate(
    **inputs,
    max_new_tokens=200,  # For longer text
    depth_decoder_top_p=0.9,
    depth_decoder_do_sample=True,
    depth_decoder_temperature=0.9,
    output_audio=True
)

Model Details

Training Configuration

  • Base Model: unsloth/csm-1b
  • Training Steps: 100
  • Batch Size: 1 (with gradient accumulation steps: 8)
  • Learning Rate: 1e-4
  • Optimizer: AdamW 8-bit
  • Scheduler: Cosine
  • LoRA Configuration: r=32, alpha=32

Dataset Information

  • Dataset: octava/indonesian-voice-transcription-1.1.85
  • Language: Indonesian
  • Speakers: 81 unique speakers
  • Audio Duration: 0.55 - 13.8 seconds per sample
  • Text Length: 5 - 147 characters per sample
  • Sample Rate: 24kHz

Parameters

  • Speaker IDs: Use integer values from 0-80 to select different voice characteristics
  • Max Tokens: 125 tokens ≈ 10 seconds of audio (adjust for longer speech)
  • Audio Format: 24kHz WAV output
  • Device: Supports both CPU and GPU inference

Limitations

  • Optimized for Indonesian language text
  • Speaker consistency depends on the selected speaker_id
  • Audio quality may vary with very long text inputs
  • Best performance with modern Indonesian text (not archaic forms)

Citation

@model{csm-1b-indonesian-fine-tuned,
  title={CSM-1B Indonesian Fine-tuned TTS Model},
  author={Ellbendls},
  year={2025},
  base_model={unsloth/csm-1b},
  dataset={octava/indonesian-voice-transcription-1.1.85}
}

Acknowledgments

  • Thanks to Unsloth for the efficient training framework
  • Original CSM-1B model developers
  • octava for providing the Indonesian voice transcription dataset
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ellbendls/csm-1b-indonesian-fine-tuned

Base model

sesame/csm-1b
Finetuned
unsloth/csm-1b
Finetuned
(99)
this model

Dataset used to train Ellbendls/csm-1b-indonesian-fine-tuned