Text-to-Speech
Safetensors
RedbeardNZ Hecheng0625 commited on
Commit
1edff88
·
verified ·
0 Parent(s):

Duplicate from amphion/MaskGCT

Browse files

Co-authored-by: Yuancheng Wang <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - amphion/Emilia-Dataset
5
+ language:
6
+ - en
7
+ - zh
8
+ - ko
9
+ - ja
10
+ - fr
11
+ - de
12
+ base_model:
13
+ - amphion/MaskGCT
14
+ pipeline_tag: text-to-speech
15
+ ---
16
+ ## MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
17
+
18
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)
19
+
20
+ ## Quickstart
21
+
22
+ **Clone and install**
23
+
24
+ ```bash
25
+ git clone https://github.com/open-mmlab/Amphion.git
26
+ # create env
27
+ bash ./models/tts/maskgct/env.sh
28
+ ```
29
+
30
+ **Model download**
31
+
32
+ We provide the following pretrained checkpoints:
33
+
34
+
35
+ | Model Name | Description |
36
+ |-------------------|-------------|
37
+ | [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens. |
38
+ | [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
39
+ | [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. |
40
+ | [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |
41
+
42
+ You can download all pretrained checkpoints from [HuggingFace](https://huggingface.co/amphion/MaskGCT/tree/main) or use huggingface api.
43
+
44
+ ```python
45
+ from huggingface_hub import hf_hub_download
46
+
47
+ # download semantic codec ckpt
48
+ semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors")
49
+
50
+ # download acoustic codec ckpt
51
+ codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors")
52
+ codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors")
53
+
54
+ # download t2s model ckpt
55
+ t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
56
+
57
+ # download s2a model ckpt
58
+ s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors")
59
+ s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
60
+ ```
61
+
62
+ **Basic Usage**
63
+
64
+ You can use the following code to generate speech from text and a prompt speech.
65
+ ```python
66
+ from models.tts.maskgct.maskgct_utils import *
67
+ from huggingface_hub import hf_hub_download
68
+ import safetensors
69
+ import soundfile as sf
70
+
71
+ if __name__ == "__main__":
72
+
73
+ # build model
74
+ device = torch.device("cuda:0")
75
+ cfg_path = "./models/tts/maskgct/config/maskgct.json"
76
+ cfg = load_config(cfg_path)
77
+ # 1. build semantic model (w2v-bert-2.0)
78
+ semantic_model, semantic_mean, semantic_std = build_semantic_model(device)
79
+ # 2. build semantic codec
80
+ semantic_codec = build_semantic_codec(cfg.model.semantic_codec, device)
81
+ # 3. build acoustic codec
82
+ codec_encoder, codec_decoder = build_acoustic_codec(cfg.model.acoustic_codec, device)
83
+ # 4. build t2s model
84
+ t2s_model = build_t2s_model(cfg.model.t2s_model, device)
85
+ # 5. build s2a model
86
+ s2a_model_1layer = build_s2a_model(cfg.model.s2a_model.s2a_1layer, device)
87
+ s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)
88
+
89
+ # download checkpoint
90
+ ...
91
+
92
+ # load semantic codec
93
+ safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
94
+ # load acoustic codec
95
+ safetensors.torch.load_model(codec_encoder, codec_encoder_ckpt)
96
+ safetensors.torch.load_model(codec_decoder, codec_decoder_ckpt)
97
+ # load t2s model
98
+ safetensors.torch.load_model(t2s_model, t2s_model_ckpt)
99
+ # load s2a model
100
+ safetensors.torch.load_model(s2a_model_1layer, s2a_1layer_ckpt)
101
+ safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)
102
+
103
+ # inference
104
+ prompt_wav_path = "./models/tts/maskgct/wav/prompt.wav"
105
+ save_path = "[YOUR SAVE PATH]"
106
+ prompt_text = " We do not break. We never give in. We never back down."
107
+ target_text = "In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision."
108
+ # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
109
+ target_len = 18
110
+
111
+ maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
112
+ semantic_model,
113
+ semantic_codec,
114
+ codec_encoder,
115
+ codec_decoder,
116
+ t2s_model,
117
+ s2a_model_1layer,
118
+ s2a_model_full,
119
+ semantic_mean,
120
+ semantic_std,
121
+ device,
122
+ )
123
+
124
+ recovered_audio = maskgct_inference_pipeline.maskgct_inference(
125
+ prompt_wav_path, prompt_text, target_text, "en", "en", target_len=target_len
126
+ )
127
+ sf.write(save_path, recovered_audio, 24000)
128
+ ```
129
+
130
+ **Training Dataset**
131
+
132
+ We use the [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset to train our models. Emilia is a multilingual and diverse in-the-wild speech dataset designed for large-scale speech generation. In this work, we use English and Chinese data from Emilia, each with 50K hours of speech (totaling 100K hours).
133
+
134
+ **Citation**
135
+
136
+ If you use MaskGCT in your research, please cite the following paper:
137
+ ```bibtex
138
+ @article{wang2024maskgct,
139
+ title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
140
+ author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
141
+ journal={arXiv preprint arXiv:2409.00750},
142
+ year={2024}
143
+ }
144
+ @inproceedings{amphion,
145
+ author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
146
+ title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
147
+ booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
148
+ year={2024}
149
+ }
150
+ ```
acoustic_codec/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2be0eb4c6a526c666584cb0c6ad9dce96e7b29752a39e113249a7e65e17d97d9
3
+ size 170172536
acoustic_codec/model_1.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9dbacf8050a4b0d7948eb4224fdc7c61dfe9cb7876d8f32560b93727358256a5
3
+ size 512832144
config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "download_tracking": {
3
+ "query_files": ["config.json", "*.safetensors"]
4
+ }
5
+ }
s2a_model/s2a_model_1layer/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d5880e467cb82cac502c6122df2b1242c721c46b8f769161d5f64cf65d9e71c
3
+ size 1321418200
s2a_model/s2a_model_full/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27518b0ffae8afdeec8d9b6102868ced38d2a93477eb992d381c188383e78cfa
3
+ size 1413786096
semantic_codec/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec947271175d8cad75ec37e83aa487e27c97a0f72a303393772da5ffa84bddf2
3
+ size 177183712
t2s_model/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:543156edd53f533572b751ca2e179c498b51fe96bb8a181e82e31b5ef455230e
3
+ size 2985622968