Safetensors
tts
vc
svs
svc
music
RMSnow commited on
Commit
bf31efa
·
verified ·
1 Parent(s): f0b3567

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +312 -53
README.md CHANGED
@@ -45,9 +45,7 @@ We have included the following pre-trained models at Amphion:
45
  The training data includes:
46
 
47
  - **Emilia-101k**: about 101k hours of speech data
48
-
49
  - **Sing-0.4k**: about 400 hours of open-source singing voice data as follows:
50
-
51
  | Dataset Name | \#Hours |
52
  | ------------ | --------- |
53
  | ACESinger | 320.6 |
@@ -58,62 +56,323 @@ The training data includes:
58
  | Opencpop | 5.1 |
59
  | CSD | 3.8 |
60
  | **Total** | **438.9** |
61
-
62
  - **SingNet-7k**: about 7,000 hours of internal singing voice data, preprocessed using the [SingNet pipeline](https://openreview.net/pdf?id=X6ffdf6nh3). The SingNet-3k is a 3000-hour subset of SingNet-7k.
63
 
64
- ## Quickstart (Inference Only)
65
-
66
- To infer with Vevo1.5, you need to follow the steps below:
67
-
68
- 1. Clone the repository and install the environment.
69
- 2. Run the inference script.
70
-
71
- > **Note:** Same environment requirement as MaskGCT/Vevo.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- ### Clone and Environment Setup
74
-
75
- #### 1. Clone the repository
76
-
77
- ```bash
78
- git clone https://github.com/open-mmlab/Amphion.git
79
- cd Amphion
80
  ```
81
 
82
- #### 2. Install the environment
83
-
84
- Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter.
85
-
86
- Since we use `phonemizer` to convert text to phoneme, you need to install `espeak-ng` first. More details can be found [here](https://bootphon.github.io/phonemizer/install.html). Choose the correct installation command according to your operating system:
87
-
88
- ```bash
89
- # For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
90
- sudo apt-get install espeak-ng
91
- # For RedHat-like distribution (e.g. CentOS, Fedora, etc.)
92
- sudo yum install espeak-ng
93
- ```
94
-
95
- Now, we are going to install the environment. It is recommended to use conda to configure:
96
-
97
- ```bash
98
- conda create -n vevo python=3.10
99
- conda activate vevo
100
-
101
- pip install -r models/vc/vevo/requirements.txt
102
- ```
103
-
104
- ### Inference Script
105
-
106
- ```sh
107
- # FM model only (i.e., timbre control. Usually for VC and SVC)
108
- python -m models.svc.vevosing.infer_vevosing_fm
109
-
110
- # AR + FM (i.e., text, prosody, and style control)
111
- python -m models.svc.vevosing.infer_vevosing_ar
112
- ```
113
-
114
- Running this will automatically download the pretrained model from HuggingFace and start the inference process. The generated audios are saved in `models/svc/vevosing/output/*.wav` by default.
115
-
116
-
117
  ## Citations
118
 
119
  If you find this work useful for your research, please cite our paper:
 
45
  The training data includes:
46
 
47
  - **Emilia-101k**: about 101k hours of speech data
 
48
  - **Sing-0.4k**: about 400 hours of open-source singing voice data as follows:
 
49
  | Dataset Name | \#Hours |
50
  | ------------ | --------- |
51
  | ACESinger | 320.6 |
 
56
  | Opencpop | 5.1 |
57
  | CSD | 3.8 |
58
  | **Total** | **438.9** |
 
59
  - **SingNet-7k**: about 7,000 hours of internal singing voice data, preprocessed using the [SingNet pipeline](https://openreview.net/pdf?id=X6ffdf6nh3). The SingNet-3k is a 3000-hour subset of SingNet-7k.
60
 
61
+ ## Usage
62
+ You can refer to our [recipe](https://github.com/open-mmlab/Amphion/blob/vevosing/models/svc/vevosing/README.md) at GitHub for more usage details. For example, to use Vevo1.5, after you clone the Amphion github repository, you can use the script like:
63
+
64
+ ```python
65
+ import os
66
+ from huggingface_hub import snapshot_download
67
+
68
+ from models.svc.vevosing.vevosing_utils import *
69
+
70
+
71
+ def vevosing_tts(
72
+ tgt_text,
73
+ ref_wav_path,
74
+ ref_text=None,
75
+ timbre_ref_wav_path=None,
76
+ output_path=None,
77
+ src_language="en",
78
+ ref_language="en",
79
+ ):
80
+ if timbre_ref_wav_path is None:
81
+ timbre_ref_wav_path = ref_wav_path
82
+
83
+ gen_audio = inference_pipeline.inference_ar_and_fm(
84
+ task="synthesis",
85
+ src_wav_path=None,
86
+ src_text=tgt_text,
87
+ style_ref_wav_path=ref_wav_path,
88
+ timbre_ref_wav_path=timbre_ref_wav_path,
89
+ style_ref_wav_text=ref_text,
90
+ src_text_language=src_language,
91
+ style_ref_wav_text_language=ref_language,
92
+ )
93
+
94
+ assert output_path is not None
95
+ save_audio(gen_audio, output_path=output_path)
96
+
97
+
98
+ def vevosing_editing(
99
+ tgt_text,
100
+ raw_wav_path,
101
+ raw_text=None,
102
+ output_path=None,
103
+ raw_language="en",
104
+ tgt_language="en",
105
+ ):
106
+ gen_audio = inference_pipeline.inference_ar_and_fm(
107
+ task="recognition-synthesis",
108
+ src_wav_path=raw_wav_path,
109
+ src_text=tgt_text,
110
+ style_ref_wav_path=raw_wav_path,
111
+ style_ref_wav_text=raw_text,
112
+ src_text_language=tgt_language,
113
+ style_ref_wav_text_language=raw_language,
114
+ timbre_ref_wav_path=raw_wav_path, # keep the timbre as the raw wav
115
+ use_style_tokens_as_ar_input=True, # To use the prosody code of the raw wav
116
+ )
117
+
118
+ assert output_path is not None
119
+ save_audio(gen_audio, output_path=output_path)
120
+
121
+
122
+ def vevosing_singing_style_conversion(
123
+ raw_wav_path,
124
+ style_ref_wav_path,
125
+ output_path=None,
126
+ raw_text=None,
127
+ style_ref_text=None,
128
+ raw_language="en",
129
+ style_ref_language="en",
130
+ ):
131
+ gen_audio = inference_pipeline.inference_ar_and_fm(
132
+ task="recognition-synthesis",
133
+ src_wav_path=raw_wav_path,
134
+ src_text=raw_text,
135
+ style_ref_wav_path=style_ref_wav_path,
136
+ style_ref_wav_text=style_ref_text,
137
+ src_text_language=raw_language,
138
+ style_ref_wav_text_language=style_ref_language,
139
+ timbre_ref_wav_path=raw_wav_path, # keep the timbre as the raw wav
140
+ use_style_tokens_as_ar_input=True, # To use the prosody code of the raw wav
141
+ )
142
+
143
+ assert output_path is not None
144
+ save_audio(gen_audio, output_path=output_path)
145
+
146
+
147
+ def vevosing_melody_control(
148
+ tgt_text,
149
+ tgt_melody_wav_path,
150
+ output_path=None,
151
+ style_ref_wav_path=None,
152
+ style_ref_text=None,
153
+ timbre_ref_wav_path=None,
154
+ tgt_language="en",
155
+ style_ref_language="en",
156
+ ):
157
+ gen_audio = inference_pipeline.inference_ar_and_fm(
158
+ task="recognition-synthesis",
159
+ src_wav_path=tgt_melody_wav_path,
160
+ src_text=tgt_text,
161
+ style_ref_wav_path=style_ref_wav_path,
162
+ style_ref_wav_text=style_ref_text,
163
+ src_text_language=tgt_language,
164
+ style_ref_wav_text_language=style_ref_language,
165
+ timbre_ref_wav_path=timbre_ref_wav_path,
166
+ use_style_tokens_as_ar_input=True, # To use the prosody code
167
+ )
168
+
169
+ assert output_path is not None
170
+ save_audio(gen_audio, output_path=output_path)
171
+
172
+
173
+ def load_inference_pipeline():
174
+ # ===== Device =====
175
+ device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
176
+
177
+ # ===== Prosody Tokenizer =====
178
+ local_dir = snapshot_download(
179
+ repo_id="amphion/Vevo1.5",
180
+ repo_type="model",
181
+ cache_dir="./ckpts/Vevo1.5",
182
+ allow_patterns=["tokenizer/prosody_fvq512_6.25hz/*"],
183
+ )
184
+ prosody_tokenizer_ckpt_path = os.path.join(
185
+ local_dir, "tokenizer/prosody_fvq512_6.25hz"
186
+ )
187
+
188
+ # ===== Content-Style Tokenizer =====
189
+ local_dir = snapshot_download(
190
+ repo_id="amphion/Vevo1.5",
191
+ repo_type="model",
192
+ cache_dir="./ckpts/Vevo1.5",
193
+ allow_patterns=["tokenizer/contentstyle_fvq16384_12.5hz/*"],
194
+ )
195
+ contentstyle_tokenizer_ckpt_path = os.path.join(
196
+ local_dir, "tokenizer/contentstyle_fvq16384_12.5hz"
197
+ )
198
+
199
+ # ===== Autoregressive Transformer =====
200
+ model_name = "ar_emilia101k_singnet7k"
201
+
202
+ local_dir = snapshot_download(
203
+ repo_id="amphion/Vevo1.5",
204
+ repo_type="model",
205
+ cache_dir="./ckpts/Vevo1.5",
206
+ allow_patterns=[f"contentstyle_modeling/{model_name}/*"],
207
+ )
208
+
209
+ ar_cfg_path = f"./models/svc/vevosing/config/{model_name}.json"
210
+ ar_ckpt_path = os.path.join(
211
+ local_dir,
212
+ f"contentstyle_modeling/{model_name}",
213
+ )
214
+
215
+ # ===== Flow Matching Transformer =====
216
+ model_name = "fm_emilia101k_singnet7k"
217
+
218
+ local_dir = snapshot_download(
219
+ repo_id="amphion/Vevo1.5",
220
+ repo_type="model",
221
+ cache_dir="./ckpts/Vevo1.5",
222
+ allow_patterns=[f"acoustic_modeling/{model_name}/*"],
223
+ )
224
+
225
+ fmt_cfg_path = f"./models/svc/vevosing/config/{model_name}.json"
226
+ fmt_ckpt_path = os.path.join(local_dir, f"acoustic_modeling/{model_name}")
227
+
228
+ # ===== Vocoder =====
229
+ local_dir = snapshot_download(
230
+ repo_id="amphion/Vevo1.5",
231
+ repo_type="model",
232
+ cache_dir="./ckpts/Vevo1.5",
233
+ allow_patterns=["acoustic_modeling/Vocoder/*"],
234
+ )
235
+
236
+ vocoder_cfg_path = "./models/svc/vevosing/config/vocoder.json"
237
+ vocoder_ckpt_path = os.path.join(local_dir, "acoustic_modeling/Vocoder")
238
+
239
+ # ===== Inference =====
240
+ inference_pipeline = VevosingInferencePipeline(
241
+ prosody_tokenizer_ckpt_path=prosody_tokenizer_ckpt_path,
242
+ content_style_tokenizer_ckpt_path=contentstyle_tokenizer_ckpt_path,
243
+ ar_cfg_path=ar_cfg_path,
244
+ ar_ckpt_path=ar_ckpt_path,
245
+ fmt_cfg_path=fmt_cfg_path,
246
+ fmt_ckpt_path=fmt_ckpt_path,
247
+ vocoder_cfg_path=vocoder_cfg_path,
248
+ vocoder_ckpt_path=vocoder_ckpt_path,
249
+ device=device,
250
+ )
251
+ return inference_pipeline
252
+
253
+
254
+ if __name__ == "__main__":
255
+ inference_pipeline = load_inference_pipeline()
256
+
257
+ output_dir = "./models/svc/vevosing/output"
258
+ os.makedirs(output_dir, exist_ok=True)
259
+
260
+ ### Zero-shot Text-to-Speech and Text-to-Singing ###
261
+ tgt_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
262
+ ref_wav_path = "./models/vc/vevo/wav/arabic_male.wav"
263
+ ref_text = "Flip stood undecided, his ears strained to catch the slightest sound."
264
+
265
+ jaychou_path = "./models/svc/vevosing/wav/jaychou.wav"
266
+ jaychou_text = (
267
+ "对这个世界如果你有太多的抱怨,跌倒了就不该继续往前走,为什么,人要这么的脆弱堕"
268
+ )
269
+ taiyizhenren_path = "./models/svc/vevosing/wav/taiyizhenren.wav"
270
+ taiyizhenren_text = (
271
+ "对,这就是我,万人敬仰的太乙真人。虽然有点婴儿肥,但也掩不住我,逼人的帅气。"
272
+ )
273
+
274
+ # the style reference and timbre reference are same
275
+ vevosing_tts(
276
+ tgt_text=tgt_text,
277
+ ref_wav_path=ref_wav_path,
278
+ timbre_ref_wav_path=ref_wav_path,
279
+ output_path=os.path.join(output_dir, "zstts.wav"),
280
+ ref_text=ref_text,
281
+ src_language="en",
282
+ ref_language="en",
283
+ )
284
+
285
+ # the style reference and timbre reference are different
286
+ vevosing_tts(
287
+ tgt_text=tgt_text,
288
+ ref_wav_path=ref_wav_path,
289
+ timbre_ref_wav_path=jaychou_path,
290
+ output_path=os.path.join(output_dir, "zstts_disentangled.wav"),
291
+ ref_text=ref_text,
292
+ src_language="en",
293
+ ref_language="en",
294
+ )
295
+
296
+ # the style reference is a singing voice
297
+ vevosing_tts(
298
+ tgt_text="顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”",
299
+ ref_wav_path=jaychou_path,
300
+ ref_text=jaychou_text,
301
+ timbre_ref_wav_path=taiyizhenren_path,
302
+ output_path=os.path.join(output_dir, "zstts_singing.wav"),
303
+ src_language="zh",
304
+ ref_language="zh",
305
+ )
306
+
307
+ ### Zero-shot Singing Editing ###
308
+ adele_path = "./models/svc/vevosing/wav/adele.wav"
309
+ adele_text = "Never mind, I'll find someone like you. I wish nothing but."
310
+
311
+ vevosing_editing(
312
+ tgt_text="Never mind, you'll find anyone like me. You wish nothing but.",
313
+ raw_wav_path=adele_path,
314
+ raw_text=adele_text, # "Never mind, I'll find someone like you. I wish nothing but."
315
+ output_path=os.path.join(output_dir, "editing_adele.wav"),
316
+ raw_language="en",
317
+ tgt_language="en",
318
+ )
319
+
320
+ vevosing_editing(
321
+ tgt_text="对你的人生如果你有太多的期盼,跌倒了��不该低头认输,为什么啊,人要这么的彷徨堕",
322
+ raw_wav_path=jaychou_path,
323
+ raw_text=jaychou_text, # "对这个世界如果你有太多的抱怨,跌倒了就不该继续往前走,为什么,人要这么的脆弱堕"
324
+ output_path=os.path.join(output_dir, "editing_jaychou.wav"),
325
+ raw_language="zh",
326
+ tgt_language="zh",
327
+ )
328
+
329
+ ### Zero-shot Singing Style Conversion ###
330
+ breathy_path = "./models/svc/vevosing/wav/breathy.wav"
331
+ breathy_text = "离别没说再见你是否心酸"
332
+
333
+ vibrato_path = "./models/svc/vevosing/wav/vibrato.wav"
334
+ vibrato_text = "玫瑰的红,容易受伤的梦,握在手中却流失于指缝"
335
+
336
+ vevosing_singing_style_conversion(
337
+ raw_wav_path=breathy_path,
338
+ raw_text=breathy_text,
339
+ style_ref_wav_path=vibrato_path,
340
+ style_ref_text=vibrato_text,
341
+ output_path=os.path.join(output_dir, "ssc_breathy2vibrato.wav"),
342
+ raw_language="zh",
343
+ style_ref_language="zh",
344
+ )
345
+
346
+ ### Melody Control for Singing Synthesis ##
347
+ humming_path = "./models/svc/vevosing/wav/humming.wav"
348
+ piano_path = "./models/svc/vevosing/wav/piano.wav"
349
+
350
+ # Humming to control the melody
351
+ vevosing_melody_control(
352
+ tgt_text="你是我的小呀小苹果,怎么爱,不嫌多",
353
+ tgt_melody_wav_path=humming_path,
354
+ output_path=os.path.join(output_dir, "melody_humming.wav"),
355
+ style_ref_wav_path=taiyizhenren_path,
356
+ style_ref_text=taiyizhenren_text,
357
+ timbre_ref_wav_path=taiyizhenren_path,
358
+ tgt_language="zh",
359
+ style_ref_language="zh",
360
+ )
361
+
362
+ # Piano to control the melody
363
+ vevosing_melody_control(
364
+ tgt_text="你是我的小呀小苹果,怎么爱,不嫌多",
365
+ tgt_melody_wav_path=piano_path,
366
+ output_path=os.path.join(output_dir, "melody_piano.wav"),
367
+ style_ref_wav_path=taiyizhenren_path,
368
+ style_ref_text=taiyizhenren_text,
369
+ timbre_ref_wav_path=taiyizhenren_path,
370
+ tgt_language="zh",
371
+ style_ref_language="zh",
372
+ )
373
 
 
 
 
 
 
 
 
374
  ```
375
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
376
  ## Citations
377
 
378
  If you find this work useful for your research, please cite our paper: