Seasme CSM Fine-Tuned on Common Voice 17 Arabic

This model is a fine-tuned version of sesame/csm-1b on the Arabic subset of Common Voice 17.0 dataset.

Model Description

Seasme Conversational Speech Model (CSM) is a state-of-the-art text-to-speech model that generates natural, conversational speech. This version has been specifically fine-tuned for Arabic speech synthesis.

The model did show learning the new language and is showing some encouraging signs. However performance is was below average, whcich was expected due to noise in the Common Voice 17 dataset which can use more pre-procesing for better results

Training Details

Training Data

Dataset: Mozilla Common Voice 17.0 (Arabic subset)
Language: Arabic (ar)

Training Hyperparameters

After running 15 sweep runs with different hyperparameters, the following were the best performing ones:

Batch Size: 24
Learning Rate: 3e-6
Epochs: 25
Optimizer: AdamW with exponential LR decay
Weight Decay: 0.014182
Max Gradient Norm: 2.923641
Warmup Steps: 569
Gradient Accumulation Steps: 1
Decoder Loss Weight: 0.5
Mixed Precision: Enabled (AMP)

Training Configuration

batch_size: 24
decoder_loss_weight: 0.5
device: "cuda"
gen_every: 2000
gen_speaker: 999
grad_acc_steps: 1
learning_rate: 0.000003
log_every: 10
lr_decay: "exponential"
max_grad_norm: 2.923641
n_epochs: 25
partial_data_loading: false
save_every: 2000
train_from_scratch: false
use_amp: true
val_every: 200
warmup_steps: 569
weight_decay: 0.014182

Generation Sample

The model was tested with the following Arabic text during training:

"في صباحٍ مشرق، تجمّع الأطفال في الساحة يلعبون ويضحكون تحت أشعة الشمس، بينما كانت الطيور تغرّد فوق الأشجار. الأمل يملأ القلوب، والحياة تمضي بخطى هادئة نحو غدٍ أجمل."

Model Architecture

Backbone: LLaMA-1B based architecture
Decoder: LLaMA-100M based decoder
Audio Codebooks: 32
Audio Vocabulary Size: 2,051
Text Vocabulary Size: 128,256

Usage

Use the following repo to run the model with Gradio: https://github.com/Saganaki22/CSM-WebUI You need at least 8GB VRAM to run the model.

Limitations and Bias

This model is specifically trained for Arabic speech synthesis
Performance may vary with different Arabic dialects
The model inherits any biases present in the Common Voice 17.0 Arabic dataset

Acknowledgments

Original CSM model by Sesame team
Mozilla Foundation for the Common Voice dataset
HuggingFace for the model hosting platform
Modal labs for the compute

MAdel121
/

Seasmed-Fine-Tuned-on-Common-Voice-17-Arabic