Upload fine-tuned Arabic CSM model on Common Voice 17

Browse files

Files changed (4) hide show

README.md +97 -0
config.json +7 -0
model.safetensors +3 -0
model_info.json +7 -0

README.md CHANGED Viewed

@@ -1,3 +1,100 @@
 ---
 license: apache-2.0
 ---

 ---
+language:
+- ar
 license: apache-2.0
+base_model: sesame/csm-1b
+tags:
+- speech-synthesis
+- text-to-speech
+- arabic
+- conversational-speech
+- csm
+- sesame
+datasets:
+- mozilla-foundation/common_voice_17_0
+pipeline_tag: text-to-speech
 ---
+# Seasme CSM Fine-Tuned on Common Voice 17 Arabic
+This model is a fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) on the Arabic subset of Common Voice 17.0 dataset.
+## Model Description
+Seasme Conversational Speech Model (CSM) is a state-of-the-art text-to-speech model that generates natural, conversational speech. This version has been specifically fine-tuned for Arabic speech synthesis. Performance is not as good as expected due to noisy Common Voice 17 dataset
+which can use more pre-procesing for better results
+## Training Details
+### Training Data
+- **Dataset**: Mozilla Common Voice 17.0 (Arabic subset)
+- **Language**: Arabic (ar)
+### Training Hyperparameters
+After running 15 sweep runs with different hyperparameters, the following were the best performing ones:
+- **Batch Size**: 24
+- **Learning Rate**: 3e-6
+- **Epochs**: 25
+- **Optimizer**: AdamW with exponential LR decay
+- **Weight Decay**: 0.014182
+- **Max Gradient Norm**: 2.923641
+- **Warmup Steps**: 569
+- **Gradient Accumulation Steps**: 1
+- **Decoder Loss Weight**: 0.5
+- **Mixed Precision**: Enabled (AMP)
+### Training Configuration
+```yaml
+batch_size: 24
+decoder_loss_weight: 0.5
+device: "cuda"
+gen_every: 2000
+gen_speaker: 999
+grad_acc_steps: 1
+learning_rate: 0.000003
+log_every: 10
+lr_decay: "exponential"
+max_grad_norm: 2.923641
+n_epochs: 25
+partial_data_loading: false
+save_every: 2000
+train_from_scratch: false
+use_amp: true
+val_every: 200
+warmup_steps: 569
+weight_decay: 0.014182
+```
+### Generation Sample
+The model was tested with the following Arabic text during training:
+> "في صباحٍ مشرق، تجمّع الأطفال في الساحة يلعبون ويضحكون تحت أشعة الشمس، بينما كانت الطيور تغرّد فوق الأشجار. الأمل يملأ القلوب، والحياة تمضي بخطى هادئة نحو غدٍ أجمل."
+## Model Architecture
+- **Backbone**: LLaMA-1B based architecture
+- **Decoder**: LLaMA-100M based decoder
+- **Audio Codebooks**: 32
+- **Audio Vocabulary Size**: 2,051
+- **Text Vocabulary Size**: 128,256
+## Usage
+Use the following repo to run the model with Gradio: https://github.com/Saganaki22/CSM-WebUI
+You need at least 8GB VRAM to run the model.
+## Limitations and Bias
+- This model is specifically trained for Arabic speech synthesis
+- Performance may vary with different Arabic dialects
+- The model inherits any biases present in the Common Voice 17.0 Arabic dataset
+## Acknowledgments
+- Original CSM model by Sesame team
+- Mozilla Foundation for the Common Voice dataset
+- HuggingFace for the model hosting platform
+- Modal labs for the compute

config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "audio_num_codebooks": 32,
+  "audio_vocab_size": 2051,
+  "backbone_flavor": "llama-1B",
+  "decoder_flavor": "llama-100M",
+  "text_vocab_size": 128256
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bd39c0f8725de9756ca9bc5015a60bf0b467bbc86270df8ebac84d79c2b6abfe
+size 3113994456

model_info.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "model_type": "csm",
+  "language": "ar",
+  "base_model": "sesame/csm-1b",
+  "dataset": "mozilla-foundation/common_voice_17_0",
+  "fine_tuned_from": "sesame/csm-1b"
+}