Seasme CSM Fine-Tuned on Common Voice 17 Arabic

This model is a fine-tuned version of sesame/csm-1b on the Arabic subset of Common Voice 17.0 dataset.

Model Description

Seasme Conversational Speech Model (CSM) is a state-of-the-art text-to-speech model that generates natural, conversational speech. This version has been specifically fine-tuned for Arabic speech synthesis.

The model did show learning the new language and is showing some encouraging signs. However performance is was below average, whcich was expected due to noise in the Common Voice 17 dataset which can use more pre-procesing for better results

Training Details

Training Data

  • Dataset: Mozilla Common Voice 17.0 (Arabic subset)
  • Language: Arabic (ar)

Training Hyperparameters

After running 15 sweep runs with different hyperparameters, the following were the best performing ones:

  • Batch Size: 24
  • Learning Rate: 3e-6
  • Epochs: 25
  • Optimizer: AdamW with exponential LR decay
  • Weight Decay: 0.014182
  • Max Gradient Norm: 2.923641
  • Warmup Steps: 569
  • Gradient Accumulation Steps: 1
  • Decoder Loss Weight: 0.5
  • Mixed Precision: Enabled (AMP)

Training Configuration

batch_size: 24
decoder_loss_weight: 0.5
device: "cuda"
gen_every: 2000
gen_speaker: 999
grad_acc_steps: 1
learning_rate: 0.000003
log_every: 10
lr_decay: "exponential"
max_grad_norm: 2.923641
n_epochs: 25
partial_data_loading: false
save_every: 2000
train_from_scratch: false
use_amp: true
val_every: 200
warmup_steps: 569
weight_decay: 0.014182

Generation Sample

The model was tested with the following Arabic text during training:

"في صباحٍ مشرق، تجمّع الأطفال في الساحة يلعبون ويضحكون تحت أشعة الشمس، بينما كانت الطيور تغرّد فوق الأشجار. الأمل يملأ القلوب، والحياة تمضي بخطى هادئة نحو غدٍ أجمل."

Model Architecture

  • Backbone: LLaMA-1B based architecture
  • Decoder: LLaMA-100M based decoder
  • Audio Codebooks: 32
  • Audio Vocabulary Size: 2,051
  • Text Vocabulary Size: 128,256

Usage

Use the following repo to run the model with Gradio: https://github.com/Saganaki22/CSM-WebUI You need at least 8GB VRAM to run the model.

Limitations and Bias

  • This model is specifically trained for Arabic speech synthesis
  • Performance may vary with different Arabic dialects
  • The model inherits any biases present in the Common Voice 17.0 Arabic dataset

Acknowledgments

  • Original CSM model by Sesame team
  • Mozilla Foundation for the Common Voice dataset
  • HuggingFace for the model hosting platform
  • Modal labs for the compute
Downloads last month
5
Safetensors
Model size
1.56B params
Tensor type
FP16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MAdel121/Seasmed-Fine-Tuned-on-Common-Voice-17-Arabic

Base model

sesame/csm-1b
Finetuned
(15)
this model

Dataset used to train MAdel121/Seasmed-Fine-Tuned-on-Common-Voice-17-Arabic

Collection including MAdel121/Seasmed-Fine-Tuned-on-Common-Voice-17-Arabic