Upload fine-tuned Arabic CSM model on Common Voice 17
Browse files- README.md +97 -0
- config.json +7 -0
- model.safetensors +3 -0
- model_info.json +7 -0
README.md
CHANGED
@@ -1,3 +1,100 @@
|
|
1 |
---
|
|
|
|
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- ar
|
4 |
license: apache-2.0
|
5 |
+
base_model: sesame/csm-1b
|
6 |
+
tags:
|
7 |
+
- speech-synthesis
|
8 |
+
- text-to-speech
|
9 |
+
- arabic
|
10 |
+
- conversational-speech
|
11 |
+
- csm
|
12 |
+
- sesame
|
13 |
+
datasets:
|
14 |
+
- mozilla-foundation/common_voice_17_0
|
15 |
+
pipeline_tag: text-to-speech
|
16 |
---
|
17 |
+
|
18 |
+
# Seasme CSM Fine-Tuned on Common Voice 17 Arabic
|
19 |
+
|
20 |
+
This model is a fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) on the Arabic subset of Common Voice 17.0 dataset.
|
21 |
+
|
22 |
+
## Model Description
|
23 |
+
|
24 |
+
Seasme Conversational Speech Model (CSM) is a state-of-the-art text-to-speech model that generates natural, conversational speech. This version has been specifically fine-tuned for Arabic speech synthesis. Performance is not as good as expected due to noisy Common Voice 17 dataset
|
25 |
+
which can use more pre-procesing for better results
|
26 |
+
|
27 |
+
## Training Details
|
28 |
+
|
29 |
+
### Training Data
|
30 |
+
- **Dataset**: Mozilla Common Voice 17.0 (Arabic subset)
|
31 |
+
- **Language**: Arabic (ar)
|
32 |
+
|
33 |
+
|
34 |
+
### Training Hyperparameters
|
35 |
+
|
36 |
+
After running 15 sweep runs with different hyperparameters, the following were the best performing ones:
|
37 |
+
|
38 |
+
- **Batch Size**: 24
|
39 |
+
- **Learning Rate**: 3e-6
|
40 |
+
- **Epochs**: 25
|
41 |
+
- **Optimizer**: AdamW with exponential LR decay
|
42 |
+
- **Weight Decay**: 0.014182
|
43 |
+
- **Max Gradient Norm**: 2.923641
|
44 |
+
- **Warmup Steps**: 569
|
45 |
+
- **Gradient Accumulation Steps**: 1
|
46 |
+
- **Decoder Loss Weight**: 0.5
|
47 |
+
- **Mixed Precision**: Enabled (AMP)
|
48 |
+
|
49 |
+
### Training Configuration
|
50 |
+
```yaml
|
51 |
+
batch_size: 24
|
52 |
+
decoder_loss_weight: 0.5
|
53 |
+
device: "cuda"
|
54 |
+
gen_every: 2000
|
55 |
+
gen_speaker: 999
|
56 |
+
grad_acc_steps: 1
|
57 |
+
learning_rate: 0.000003
|
58 |
+
log_every: 10
|
59 |
+
lr_decay: "exponential"
|
60 |
+
max_grad_norm: 2.923641
|
61 |
+
n_epochs: 25
|
62 |
+
partial_data_loading: false
|
63 |
+
save_every: 2000
|
64 |
+
train_from_scratch: false
|
65 |
+
use_amp: true
|
66 |
+
val_every: 200
|
67 |
+
warmup_steps: 569
|
68 |
+
weight_decay: 0.014182
|
69 |
+
```
|
70 |
+
|
71 |
+
### Generation Sample
|
72 |
+
The model was tested with the following Arabic text during training:
|
73 |
+
> "في صباحٍ مشرق، تجمّع الأطفال في الساحة يلعبون ويضحكون تحت أشعة الشمس، بينما كانت الطيور تغرّد فوق الأشجار. الأمل يملأ القلوب، والحياة تمضي بخطى هادئة نحو غدٍ أجمل."
|
74 |
+
|
75 |
+
## Model Architecture
|
76 |
+
|
77 |
+
- **Backbone**: LLaMA-1B based architecture
|
78 |
+
- **Decoder**: LLaMA-100M based decoder
|
79 |
+
- **Audio Codebooks**: 32
|
80 |
+
- **Audio Vocabulary Size**: 2,051
|
81 |
+
- **Text Vocabulary Size**: 128,256
|
82 |
+
|
83 |
+
## Usage
|
84 |
+
|
85 |
+
Use the following repo to run the model with Gradio: https://github.com/Saganaki22/CSM-WebUI
|
86 |
+
You need at least 8GB VRAM to run the model.
|
87 |
+
|
88 |
+
## Limitations and Bias
|
89 |
+
|
90 |
+
- This model is specifically trained for Arabic speech synthesis
|
91 |
+
- Performance may vary with different Arabic dialects
|
92 |
+
- The model inherits any biases present in the Common Voice 17.0 Arabic dataset
|
93 |
+
|
94 |
+
|
95 |
+
## Acknowledgments
|
96 |
+
|
97 |
+
- Original CSM model by Sesame team
|
98 |
+
- Mozilla Foundation for the Common Voice dataset
|
99 |
+
- HuggingFace for the model hosting platform
|
100 |
+
- Modal labs for the compute
|
config.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"audio_num_codebooks": 32,
|
3 |
+
"audio_vocab_size": 2051,
|
4 |
+
"backbone_flavor": "llama-1B",
|
5 |
+
"decoder_flavor": "llama-100M",
|
6 |
+
"text_vocab_size": 128256
|
7 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bd39c0f8725de9756ca9bc5015a60bf0b467bbc86270df8ebac84d79c2b6abfe
|
3 |
+
size 3113994456
|
model_info.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"model_type": "csm",
|
3 |
+
"language": "ar",
|
4 |
+
"base_model": "sesame/csm-1b",
|
5 |
+
"dataset": "mozilla-foundation/common_voice_17_0",
|
6 |
+
"fine_tuned_from": "sesame/csm-1b"
|
7 |
+
}
|