Spaces:

Billpai
/

test2

Build error

App Files Files Community

Billpai commited on Apr 30, 2024

Commit

f196feb

1 Parent(s): 507c407

test

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

egs/svc/MultipleContentsSVC/README.md +153 -0
egs/svc/MultipleContentsSVC/exp_config.json +126 -0
egs/svc/MultipleContentsSVC/run.sh +1 -0
egs/svc/README.md +34 -0
egs/svc/_template/run.sh +150 -0
egs/vocoder/README.md +23 -0
egs/vocoder/diffusion/README.md +0 -0
egs/vocoder/diffusion/exp_config_base.json +0 -0
egs/vocoder/gan/README.md +224 -0
egs/vocoder/gan/_template/run.sh +143 -0
egs/vocoder/gan/apnet/exp_config.json +45 -0
egs/vocoder/gan/apnet/run.sh +143 -0
egs/vocoder/gan/bigvgan/exp_config.json +66 -0
egs/vocoder/gan/bigvgan/run.sh +143 -0
egs/vocoder/gan/bigvgan_large/exp_config.json +70 -0
egs/vocoder/gan/bigvgan_large/run.sh +143 -0
egs/vocoder/gan/exp_config_base.json +111 -0
egs/vocoder/gan/hifigan/exp_config.json +59 -0
egs/vocoder/gan/hifigan/run.sh +143 -0
egs/vocoder/gan/melgan/exp_config.json +34 -0
egs/vocoder/gan/melgan/run.sh +143 -0
egs/vocoder/gan/nsfhifigan/exp_config.json +83 -0
egs/vocoder/gan/nsfhifigan/run.sh +143 -0
egs/vocoder/gan/tfr_enhanced_hifigan/README.md +185 -0
egs/vocoder/gan/tfr_enhanced_hifigan/exp_config.json +118 -0
egs/vocoder/gan/tfr_enhanced_hifigan/run.sh +145 -0
examples/chinese_female_recordings.wav +3 -0
examples/chinese_male_seperated.wav +3 -0
examples/english_female_seperated.wav +3 -0
examples/english_male_recordings.wav +3 -0
examples/output/.DS_Store +0 -0
examples/output/chinese_female_recordings_vocalist_l1_JohnMayer.wav +3 -0
examples/output/chinese_male_seperated_vocalist_l1_TaylorSwift.wav +3 -0
examples/output/english_female_seperated_vocalist_l1_汪峰.wav +3 -0
examples/output/english_male_recordings_vocalist_l1_石倚洁.wav +3 -0
models/__init__.py +0 -0
models/base/__init__.py +7 -0
models/base/base_dataset.py +350 -0
models/base/base_inference.py +220 -0
models/base/base_sampler.py +136 -0
models/base/base_trainer.py +348 -0
models/base/new_dataset.py +50 -0
models/base/new_inference.py +249 -0
models/base/new_trainer.py +722 -0
models/svc/__init__.py +0 -0
models/svc/base/__init__.py +7 -0
models/svc/base/svc_dataset.py +425 -0
models/svc/base/svc_inference.py +15 -0
models/svc/base/svc_trainer.py +111 -0
models/svc/comosvc/__init__.py +4 -0

egs/svc/MultipleContentsSVC/README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+# Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160)
+[![demo](https://img.shields.io/badge/SVC-Demo-red)](https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html)
+<br>
+<div align="center">
+<img src="../../../imgs/svc/MultipleContentsSVC.png" width="85%">
+</div>
+<br>
+This is the official implementation of the paper "[Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion](https://arxiv.org/abs/2310.11160)" (NeurIPS 2023 Workshop on Machine Learning for Audio). Specially,
+- The muptile content features are from [Whipser](https://github.com/wenet-e2e/wenet) and [ContentVec](https://github.com/auspicious3000/contentvec).
+- The acoustic model is based on Bidirectional Non-Causal Dilated CNN (called `DiffWaveNetSVC` in Amphion), which is similar to [WaveNet](https://arxiv.org/pdf/1609.03499.pdf), [DiffWave](https://openreview.net/forum?id=a-xFK8Ymz5J), and [DiffSVC](https://ieeexplore.ieee.org/document/9688219).
+- The vocoder is [BigVGAN](https://github.com/NVIDIA/BigVGAN) architecture and we fine-tuned it in over 120 hours singing voice data.
+There are four stages in total:
+1. Data preparation
+2. Features extraction
+3. Training
+4. Inference/conversion
+> **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
+> ```bash
+> cd Amphion
+> ```
+## 1. Data Preparation
+### Dataset Download
+By default, we utilize the five datasets for training: M4Singer, Opencpop, OpenSinger, SVCC, and VCTK. How to download them is detailed [here](../../datasets/README.md).
+### Configuration
+Specify the dataset paths in  `exp_config.json`. Note that you can change the `dataset` list to use your preferred datasets.
+```json
+    "dataset": [
+        "m4singer",
+        "opencpop",
+        "opensinger",
+        "svcc",
+        "vctk"
+    ],
+    "dataset_path": {
+        // TODO: Fill in your dataset path
+        "m4singer": "[M4Singer dataset path]",
+        "opencpop": "[Opencpop dataset path]",
+        "opensinger": "[OpenSinger dataset path]",
+        "svcc": "[SVCC dataset path]",
+        "vctk": "[VCTK dataset path]"
+    },
+```
+## 2. Features Extraction
+### Content-based Pretrained Models Download
+By default, we utilize the Whisper and ContentVec to extract content features. How to download them is detailed [here](../../../pretrained/README.md).
+### Configuration
+Specify the dataset path and the output path for saving the processed data and the training model in `exp_config.json`:
+```json
+    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/svc"
+    "log_dir": "ckpts/svc",
+    "preprocess": {
+        // TODO: Fill in the output data path. The default value is "Amphion/data"
+        "processed_dir": "data",
+        ...
+    },
+```
+### Run
+Run the `run.sh` as the preproces stage (set  `--stage 1`).
+```bash
+sh egs/svc/MultipleContentsSVC/run.sh --stage 1
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "1"`.
+## 3. Training
+### Configuration
+We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on you GPU machines.
+```json
+"train": {
+        "batch_size": 32,
+        ...
+        "adamw": {
+            "lr": 2.0e-4
+        },
+        ...
+    }
+```
+### Run
+Run the `run.sh` as the training stage (set  `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/svc/[YourExptName]`.
+```bash
+sh egs/svc/MultipleContentsSVC/run.sh --stage 2 --name [YourExptName]
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
+## 4. Inference/Conversion
+### Pretrained Vocoder Download
+We fine-tune the official BigVGAN pretrained model with over 120 hours singing voice data. The benifits of fine-tuning has been investigated in our paper (see this [demo page](https://www.zhangxueyao.com/data/MultipleContentsSVC/vocoder.html)). The final pretrained singing voice vocoder is released [here](../../../pretrained/README.md#amphion-singing-bigvgan) (called `Amphion Singing BigVGAN`).
+### Run
+For inference/conversion, you need to specify the following configurations when running `run.sh`:
+| Parameters                                          | Description                                                                                                                                | Example                                                                                                                                                                            |
+| --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--infer_expt_dir`                                  | The experimental directory which contains `checkpoint`                                                                                     | `Amphion/ckpts/svc/[YourExptName]`                                                                                                                                                 |
+| `--infer_output_dir`                                | The output directory to save inferred audios.                                                                                              | `Amphion/ckpts/svc/[YourExptName]/result`                                                                                                                                          |
+| `--infer_source_file` or `--infer_source_audio_dir` | The inference source (can be a json file or a dir).                                                                                        | The `infer_source_file` could be `Amphion/data/[YourDataset]/test.json`, and the `infer_source_audio_dir` is a folder which includes several audio files (*.wav, *.mp3 or *.flac). |
+| `--infer_target_speaker`                            | The target speaker you want to convert into. You can refer to `Amphion/ckpts/svc/[YourExptName]/singers.json` to choose a trained speaker. | For opencpop dataset, the speaker name would be `opencpop_female1`.                                                                                                                |
+| `--infer_key_shift`                                 | How many semitones you want to transpose.                                                                                                  | `"autoshfit"` (by default), `3`, `-3`, etc.                                                                                                                                        |
+For example, if you want to make `opencpop_female1` sing the songs in the `[Your Audios Folder]`, just run:
+```bash
+sh egs/svc/MultipleContentsSVC/run.sh --stage 3 --gpu "0" \
+	--infer_expt_dir Amphion/ckpts/svc/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/svc/[YourExptName]/result \
+	--infer_source_audio_dir [Your Audios Folder] \
+	--infer_target_speaker "opencpop_female1" \
+	--infer_key_shift "autoshift"
+```
+## Citations
+```bibtex
+@article{zhang2023leveraging,
+  title={Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion},
+  author={Zhang, Xueyao and Gu, Yicheng and Chen, Haopeng and Fang, Zihao and Zou, Lexiao and Xue, Liumeng and Wu, Zhizheng},
+  journal={Machine Learning for Audio Worshop, NeurIPS 2023},
+  year={2023}
+}
+```

egs/svc/MultipleContentsSVC/exp_config.json ADDED Viewed

	@@ -0,0 +1,126 @@

+{
+    "base_config": "config/diffusion.json",
+    "model_type": "DiffWaveNetSVC",
+    "dataset": [
+        "m4singer",
+        "opencpop",
+        "opensinger",
+        "svcc",
+        "vctk"
+    ],
+    "dataset_path": {
+        // TODO: Fill in your dataset path
+        "m4singer": "[M4Singer dataset path]",
+        "opencpop": "[Opencpop dataset path]",
+        "opensinger": "[OpenSinger dataset path]",
+        "svcc": "[SVCC dataset path]",
+        "vctk": "[VCTK dataset path]"
+    },
+    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/svc"
+    "log_dir": "ckpts/svc",
+    "preprocess": {
+        // TODO: Fill in the output data path. The default value is "Amphion/data"
+        "processed_dir": "data",
+        // Config for features extraction
+        "extract_mel": true,
+        "extract_pitch": true,
+        "extract_energy": true,
+        "extract_whisper_feature": true,
+        "extract_contentvec_feature": true,
+        "extract_wenet_feature": false,
+        "whisper_batch_size": 30, // decrease it if your GPU is out of memory
+        "contentvec_batch_size": 1,
+        // Fill in the content-based pretrained model's path
+        "contentvec_file": "pretrained/contentvec/checkpoint_best_legacy_500.pt",
+        "wenet_model_path": "pretrained/wenet/20220506_u2pp_conformer_exp/final.pt",
+        "wenet_config": "pretrained/wenet/20220506_u2pp_conformer_exp/train.yaml",
+        "whisper_model": "medium",
+        "whisper_model_path": "pretrained/whisper/medium.pt",
+        // Config for features usage
+        "use_mel": true,
+        "use_min_max_norm_mel": true,
+        "use_frame_pitch": true,
+        "use_frame_energy": true,
+        "use_spkid": true,
+        "use_whisper": true,
+        "use_contentvec": true,
+        "use_wenet": false,
+        "n_mel": 100,
+        "sample_rate": 24000
+    },
+    "model": {
+        "condition_encoder": {
+            // Config for features usage
+            "use_whisper": true,
+            "use_contentvec": true,
+            "use_wenet": false,
+            "whisper_dim": 1024,
+            "contentvec_dim": 256,
+            "wenet_dim": 512,
+            "use_singer_encoder": false,
+            "pitch_min": 50,
+            "pitch_max": 1100
+        },
+        "diffusion": {
+            "scheduler": "ddpm",
+            "scheduler_settings": {
+                "num_train_timesteps": 1000,
+                "beta_start": 1.0e-4,
+                "beta_end": 0.02,
+                "beta_schedule": "linear"
+            },
+            // Diffusion steps encoder
+            "step_encoder": {
+                "dim_raw_embedding": 128,
+                "dim_hidden_layer": 512,
+                "activation": "SiLU",
+                "num_layer": 2,
+                "max_period": 10000
+            },
+            // Diffusion decoder
+            "model_type": "bidilconv",
+            // bidilconv, unet2d, TODO: unet1d
+            "bidilconv": {
+                "base_channel": 512,
+                "n_res_block": 40,
+                "conv_kernel_size": 3,
+                "dilation_cycle_length": 4,
+                // specially, 1 means no dilation
+                "conditioner_size": 384
+            }
+        }
+    },
+    "train": {
+        "batch_size": 32,
+        "gradient_accumulation_step": 1,
+        "max_epoch": -1, // -1 means no limit
+        "save_checkpoint_stride": [
+            3,
+            50
+        ],
+        "keep_last": [
+            3,
+            2
+        ],
+        "run_eval": [
+            true,
+            true
+        ],
+        "adamw": {
+            "lr": 2.0e-4
+        },
+        "reducelronplateau": {
+            "factor": 0.8,
+            "patience": 30,
+            "min_lr": 1.0e-4
+        },
+        "dataloader": {
+            "num_worker": 8,
+            "pin_memory": true
+        },
+        "sampler": {
+            "holistic_shuffle": false,
+            "drop_last": true
+        }
+    }
+}

egs/svc/MultipleContentsSVC/run.sh ADDED Viewed

	@@ -0,0 +1 @@


1	+ ../_template/run.sh

egs/svc/README.md ADDED Viewed

	@@ -0,0 +1,34 @@

+# Amphion Singing Voice Conversion (SVC) Recipe
+## Quick Start
+We provide a **[beginner recipe](MultipleContentsSVC)** to demonstrate how to train a cutting edge SVC model. Specifically, it is also an official implementation of the paper "[Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion](https://arxiv.org/abs/2310.11160)" (NeurIPS 2023 Workshop on Machine Learning for Audio). Some demos can be seen [here](https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html).
+## Supported Model Architectures
+The main idea of SVC is to first disentangle the speaker-agnostic representations from the source audio, and then inject the desired speaker information to synthesize the target, which usually utilizes an acoustic decoder and a subsequent waveform synthesizer (vocoder):
+<br>
+<div align="center">
+  <img src="../../imgs/svc/pipeline.png" width="70%">
+</div>
+<br>
+Until now, Amphion SVC has supported the following features and models:
+- **Speaker-agnostic Representations**:
+  - Content Features: Sourcing from [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec).
+  - Prosody Features: F0 and energy.
+- **Speaker Embeddings**:
+  - Speaker Look-Up Table.
+  - Reference Encoder (👨‍💻 developing): It can be used for zero-shot SVC.
+- **Acoustic Decoders**:
+  - Diffusion-based models:
+    - **[DiffWaveNetSVC](MultipleContentsSVC)**: The encoder is based on Bidirectional Non-Causal Dilated CNN, which is similar to [WaveNet](https://arxiv.org/pdf/1609.03499.pdf), [DiffWave](https://openreview.net/forum?id=a-xFK8Ymz5J), and [DiffSVC](https://ieeexplore.ieee.org/document/9688219).
+    - **[DiffComoSVC](DiffComoSVC)** (👨‍💻 developing): The diffusion framework is based on [Consistency Model](https://proceedings.mlr.press/v202/song23a.html). It can significantly accelerate the inference process of the diffusion model.
+  - Transformer-based models:
+    - **[TransformerSVC](TransformerSVC)**: Encoder-only and Non-autoregressive Transformer Architecture.
+  - VAE- and Flow-based models:
+    - **[VitsSVC]()** (👨‍💻 developing): It is designed as a [VITS](https://arxiv.org/abs/2106.06103)-like model whose textual input is replaced by the content features, which is similar to [so-vits-svc](https://github.com/svc-develop-team/so-vits-svc).
+- **Waveform Synthesizers (Vocoders)**:
+  - The supported vocoders can be seen in [Amphion Vocoder Recipe](../vocoder/README.md).

egs/svc/_template/run.sh ADDED Viewed

	@@ -0,0 +1,150 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $exp_dir)))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,resume_from_ckpt_path:,resume_type:,infer_expt_dir:,infer_output_dir:,infer_source_file:,infer_source_audio_dir:,infer_target_speaker:,infer_key_shift:,infer_vocoder_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --resume_from_ckpt_path) shift; resume_from_ckpt_path=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    # [Only for Inference] The inference source (can be a json file or a dir). For example, the source_file can be "[Your path to save processed data]/[YourDataset]/test.json", and the source_audio_dir can be "$work_dir/source_audio" which includes several audio files (*.wav, *.mp3 or *.flac).
+    --infer_source_file) shift; infer_source_file=$1 ; shift ;;
+    --infer_source_audio_dir) shift; infer_source_audio_dir=$1 ; shift ;;
+    # [Only for Inference] Specify the target speaker you want to convert into. You can refer to "[Your path to save logs and checkpoints]/[Your Expt Name]/singers.json". In this singer look-up table, you can see the usable speaker names (all the keys of the dictionary). For example, for opencpop dataset, the speaker name would be "opencpop_female1".
+    --infer_target_speaker) shift; infer_target_speaker=$1 ; shift ;;
+    # [Only for Inference] For advanced users, you can modify the trans_key parameters into an integer (which means the semitones you want to transpose). Its default value is "autoshift".
+    --infer_key_shift) shift; infer_key_shift=$1 ; shift ;;
+    # [Only for Inference] The vocoder dir. Its default value is Amphion/pretrained/bigvgan. See Amphion/pretrained/README.md to download the pretrained BigVGAN vocoders.
+    --infer_vocoder_dir) shift; infer_vocoder_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/svc/preprocess.py \
+        --config $exp_config \
+        --num_workers 4
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/svc/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/svc/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume_from_ckpt_path "$resume_from_ckpt_path" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$expt_dir/result"
+    fi
+    if [ -z "$infer_source_file" ] && [ -z "$infer_source_audio_dir" ]; then
+        echo "[Error] Please specify the source file/dir. The inference source (can be a json file or a dir). For example, the source_file can be "[Your path to save processed data]/[YourDataset]/test.json", and the source_audio_dir should include several audio files (*.wav, *.mp3 or *.flac)."
+        exit 1
+    fi
+    if [ -z "$infer_source_file" ]; then
+        infer_source=$infer_source_audio_dir
+    fi
+    if [ -z "$infer_source_audio_dir" ]; then
+        infer_source=$infer_source_file
+    fi
+    if [ -z "$infer_target_speaker" ]; then
+        echo "[Error] Please specify the target speaker. You can refer to "[Your path to save logs and checkpoints]/[Your Expt Name]/singers.json". In this singer look-up table, you can see the usable speaker names (all the keys of the dictionary). For example, for opencpop dataset, the speaker name would be "opencpop_female1""
+        exit 1
+    fi
+    if [ -z "$infer_key_shift" ]; then
+        infer_key_shift="autoshift"
+    fi
+    if [ -z "$infer_vocoder_dir" ]; then
+        infer_vocoder_dir="$work_dir"/pretrained/bigvgan
+        echo "[Warning] You don't specify the infer_vocoder_dir. It is set $infer_vocoder_dir by default. Make sure that you have followed Amphoion/pretrained/README.md to download the pretrained BigVGAN vocoder checkpoint."
+    fi
+    CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/svc/inference.py \
+        --config $exp_config \
+        --acoustics_dir $infer_expt_dir \
+        --vocoder_dir $infer_vocoder_dir \
+        --target_singer $infer_target_speaker \
+        --trans_key $infer_key_shift \
+        --source $infer_source \
+        --output_dir $infer_output_dir  \
+        --log_level debug
+fi

egs/vocoder/README.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# Amphion Vocoder Recipe
+## Quick Start
+We provide a [**beginner recipe**](gan/tfr_enhanced_hifigan/README.md) to demonstrate how to train a high quality HiFi-GAN speech vocoder. Specially, it is also an official implementation of our paper "[Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder](https://arxiv.org/abs/2311.14957)". Some demos can be seen [here](https://vocodexelysium.github.io/MS-SB-CQTD/).
+## Supported Models
+Neural vocoder generates audible waveforms from acoustic representations, which is one of the key parts for current audio generation systems. Until now, Amphion has supported various widely-used vocoders according to different vocoder types, including:
+- **GAN-based vocoders**, which we have provided [**a unified recipe**](gan/README.md) :
+  - [MelGAN](https://arxiv.org/abs/1910.06711)
+  - [HiFi-GAN](https://arxiv.org/abs/2010.05646)
+  - [NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts)
+  - [BigVGAN](https://arxiv.org/abs/2206.04658)
+  - [APNet](https://arxiv.org/abs/2305.07952)
+- **Flow-based vocoders** (👨‍💻 developing):
+  - [WaveGlow](https://arxiv.org/abs/1811.00002)
+- **Diffusion-based vocoders** (👨‍💻 developing):
+  - [Diffwave](https://arxiv.org/abs/2009.09761)
+- **Auto-regressive based vocoders** (👨‍💻 developing):
+  - [WaveNet](https://arxiv.org/abs/1609.03499)
+  - [WaveRNN](https://arxiv.org/abs/1802.08435v1)

egs/vocoder/diffusion/README.md ADDED Viewed

File without changes

egs/vocoder/diffusion/exp_config_base.json ADDED Viewed

File without changes

egs/vocoder/gan/README.md ADDED Viewed

	@@ -0,0 +1,224 @@

+# Amphion GAN-based Vocoder Recipe
+## Supported Model Architectures
+GAN-based Vocoder consists of a generator and multiple discriminators, as illustrated below:
+<br>
+<div align="center">
+  <img src="../../../imgs/vocoder/gan/pipeline.png" width="40%">
+</div>
+<br>
+Until now, Amphion GAN-based Vocoder has supported the following generators and discriminators.
+- **Generators**
+    - [MelGAN](https://arxiv.org/abs/1910.06711)
+    - [HiFi-GAN](https://arxiv.org/abs/2010.05646)
+    - [NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts)
+    - [BigVGAN](https://arxiv.org/abs/2206.04658)
+    - [APNet](https://arxiv.org/abs/2305.07952)
+- **Discriminators**
+    - [Multi-Scale Discriminator](https://arxiv.org/abs/2010.05646)
+    - [Multi-Period Discriminator](https://arxiv.org/abs/2010.05646)
+    - [Multi-Resolution Discriminator](https://arxiv.org/abs/2011.09631)
+    - [Multi-Scale Short-Time Fourier Transform Discriminator](https://arxiv.org/abs/2210.13438)
+    - [**Multi-Scale Constant-Q Transfrom Discriminator (ours)**](https://arxiv.org/abs/2311.14957)
+You can use any vocoder architecture with any dataset you want. There are four steps in total:
+1. Data preparation
+2. Feature extraction
+3. Training
+4. Inference
+> **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
+> ```bash
+> cd Amphion
+> ```
+## 1. Data Preparation
+You can train the vocoder with any datasets. Amphion's supported open-source datasets are detailed [here](../../../datasets/README.md).
+### Configuration
+Specify the dataset path in  `exp_config_base.json`. Note that you can change the `dataset` list to use your preferred datasets.
+```json
+"dataset": [
+    "csd",
+    "kising",
+    "m4singer",
+    "nus48e",
+    "opencpop",
+    "opensinger",
+    "opera",
+    "pjs",
+    "popbutfy",
+    "popcs",
+    "ljspeech",
+    "vctk",
+    "libritts",
+],
+"dataset_path": {
+    // TODO: Fill in your dataset path
+    "csd": "[dataset path]",
+    "kising": "[dataset path]",
+    "m4singer": "[dataset path]",
+    "nus48e": "[dataset path]",
+    "opencpop": "[dataset path]",
+    "opensinger": "[dataset path]",
+    "opera": "[dataset path]",
+    "pjs": "[dataset path]",
+    "popbutfy": "[dataset path]",
+    "popcs": "[dataset path]",
+    "ljspeech": "[dataset path]",
+    "vctk": "[dataset path]",
+    "libritts": "[dataset path]",
+},
+```
+### 2. Feature Extraction
+The needed features are speficied in the individual vocoder direction so it doesn't require any modification.
+### Configuration
+Specify the dataset path and the output path for saving the processed data and the training model in `exp_config_base.json`:
+```json
+    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/vocoder"
+    "log_dir": "ckpts/vocoder",
+    "preprocess": {
+        // TODO: Fill in the output data path. The default value is "Amphion/data"
+        "processed_dir": "data",
+        ...
+    },
+```
+### Run
+Run the `run.sh` as the preproces stage (set  `--stage 1`).
+```bash
+sh egs/vocoder/gan/{vocoder_name}/run.sh --stage 1
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "1"`.
+## 3. Training
+### Configuration
+We provide the default hyparameters in the `exp_config_base.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on you GPU machines.
+```json
+"train": {
+    "batch_size": 16,
+    "max_epoch": 1000000,
+    "save_checkpoint_stride": [20],
+    "adamw": {
+        "lr": 2.0e-4,
+        "adam_b1": 0.8,
+        "adam_b2": 0.99
+    },
+    "exponential_lr": {
+        "lr_decay": 0.999
+    },
+}
+```
+You can also choose any amount of prefered discriminators for training in the `exp_config_base.json`.
+```json
+"discriminators": [
+    "msd",
+    "mpd",
+    "msstftd",
+    "mssbcqtd",
+],
+```
+### Run
+Run the `run.sh` as the training stage (set  `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/vocoder/[YourExptName]`.
+```bash
+sh egs/vocoder/gan/{vocoder_name}/run.sh --stage 2 --name [YourExptName]
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
+## 4. Inference
+### Run
+Run the `run.sh` as the training stage (set  `--stage 3`), we provide three different inference modes, including `infer_from_dataset`, `infer_from_feature`, `and infer_from_audio`.
+```bash
+sh egs/vocoder/gan/{vocoder_name}/run.sh --stage 3 \
+	--infer_mode [Your chosen inference mode] \
+	--infer_datasets [Datasets you want to inference, needed when infer_from_dataset] \
+	--infer_feature_dir [Your path to your predicted acoustic features, needed when infer_from_feature] \
+	--infer_audio_dir [Your path to your audio files, needed when infer_form_audio] \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```
+#### a. Inference from Dataset
+Run the `run.sh` with specified datasets, here is an example.
+```bash
+sh egs/vocoder/gan/{vocoder_name}/run.sh --stage 3 \
+	--infer_mode infer_from_dataset \
+	--infer_datasets "libritts vctk ljspeech" \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```
+#### b. Inference from Features
+If you want to inference from your generated acoustic features, you should first prepare your acoustic features into the following structure:
+```plaintext
+ ┣ {infer_feature_dir}
+ ┃ ┣ mels
+ ┃ ┃ ┣ sample1.npy
+ ┃ ┃ ┣ sample2.npy
+ ┃ ┣ f0s (required if you use NSF-HiFiGAN)
+ ┃ ┃ ┣ sample1.npy
+ ┃ ┃ ┣ sample2.npy
+```
+Then run the `run.sh` with specificed folder direction, here is an example.
+```bash
+sh egs/vocoder/gan/{vocoder_name}/run.sh --stage 3 \
+	--infer_mode infer_from_feature \
+	--infer_feature_dir [Your path to your predicted acoustic features] \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```
+#### c. Inference from Audios
+If you want to inference from audios for quick analysis synthesis, you should first prepare your audios into the following structure:
+```plaintext
+ ┣ audios
+ ┃ ┣ sample1.wav
+ ┃ ┣ sample2.wav
+```
+Then run the `run.sh` with specificed folder direction, here is an example.
+```bash
+sh egs/vocoder/gan/{vocoder_name}/run.sh --stage 3 \
+	--infer_mode infer_from_audio \
+	--infer_audio_dir [Your path to your audio files] \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```

egs/vocoder/gan/_template/run.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

egs/vocoder/gan/apnet/exp_config.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    "extract_amplitude_phase": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true,
+    "use_amplitude_phase": true
+  },
+  "model": {
+    "generator": "apnet",
+    "apnet": {
+      "ASP_channel": 512,
+      "ASP_resblock_kernel_sizes": [3,7,11],
+      "ASP_resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
+      "ASP_input_conv_kernel_size": 7,
+      "ASP_output_conv_kernel_size": 7,
+      "PSP_channel": 512,
+      "PSP_resblock_kernel_sizes": [3,7,11],
+      "PSP_resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
+      "PSP_input_conv_kernel_size": 7,
+      "PSP_output_R_conv_kernel_size": 7,
+      "PSP_output_I_conv_kernel_size": 7,
+    }
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "mel",
+        "phase",
+        "amplitude",
+        "consistency"
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

egs/vocoder/gan/apnet/run.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

egs/vocoder/gan/bigvgan/exp_config.json ADDED Viewed

	@@ -0,0 +1,66 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true
+  },
+  "model": {
+    "generator": "bigvgan",
+    "bigvgan": {
+      "resblock": "1",
+      "activation": "snakebeta",
+      "snake_logscale": true,
+      "upsample_rates": [
+        8,
+        8,
+        2,
+        2,
+      ],
+      "upsample_kernel_sizes": [
+        16,
+        16,
+        4,
+        4
+      ],
+      "upsample_initial_channel": 512,
+      "resblock_kernel_sizes": [
+        3,
+        7,
+        11
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ]
+      ]
+    }
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "mel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

egs/vocoder/gan/bigvgan/run.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

egs/vocoder/gan/bigvgan_large/exp_config.json ADDED Viewed

	@@ -0,0 +1,70 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true
+  },
+  "model": {
+    "generator": "bigvgan",
+    "bigvgan": {
+      "resblock": "1",
+      "activation": "snakebeta",
+      "snake_logscale": true,
+      "upsample_rates": [
+        4,
+        4,
+        2,
+        2,
+        2,
+        2
+      ],
+      "upsample_kernel_sizes": [
+        8,
+        8,
+        4,
+        4,
+        4,
+        4
+      ],
+      "upsample_initial_channel": 1536,
+      "resblock_kernel_sizes": [
+        3,
+        7,
+        11
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ]
+      ]
+    },
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "mel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

egs/vocoder/gan/bigvgan_large/run.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

egs/vocoder/gan/exp_config_base.json ADDED Viewed

	@@ -0,0 +1,111 @@

+{
+  "base_config": "config/vocoder.json",
+  "model_type": "GANVocoder",
+  // TODO: Choose your needed datasets
+  "dataset": [
+    "csd",
+    "kising",
+    "m4singer",
+    "nus48e",
+    "opencpop",
+    "opensinger",
+    "opera",
+    "pjs",
+    "popbutfy",
+    "popcs",
+    "ljspeech",
+    "vctk",
+    "libritts",
+  ],
+  "dataset_path": {
+    // TODO: Fill in your dataset path
+    "csd": "[dataset path]",
+    "kising": "[dataset path]",
+    "m4singer": "[dataset path]",
+    "nus48e": "[dataset path]",
+    "opencpop": "[dataset path]",
+    "opensinger": "[dataset path]",
+    "opera": "[dataset path]",
+    "pjs": "[dataset path]",
+    "popbutfy": "[dataset path]",
+    "popcs": "[dataset path]",
+    "ljspeech": "[dataset path]",
+    "vctk": "[dataset path]",
+    "libritts": "[dataset path]",
+  },
+  // TODO: Fill in the output log path
+  "log_dir": "ckpts/vocoder",
+  "preprocess": {
+    // Acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    "extract_pitch": false,
+    "extract_uv": false,
+    "pitch_extractor": "parselmouth",
+    // Features used for model training
+    "use_mel": true,
+    "use_frame_pitch": false,
+    "use_uv": false,
+    "use_audio": true,
+    // TODO: Fill in the output data path
+    "processed_dir": "data/",
+    "n_mel": 100,
+    "sample_rate": 24000
+  },
+  "model": {
+    // TODO: Choose your needed discriminators
+    "discriminators": [
+      "msd",
+      "mpd",
+      "msstftd",
+      "mssbcqtd",
+    ],
+    "mpd": {
+      "mpd_reshapes": [
+        2,
+        3,
+        5,
+        7,
+        11
+      ],
+      "use_spectral_norm": false,
+      "discriminator_channel_mult_factor": 1
+    },
+    "mrd": {
+      "resolutions": [[1024, 120, 600], [2048, 240, 1200], [512, 50, 240]],
+      "use_spectral_norm": false,
+      "discriminator_channel_mult_factor": 1,
+      "mrd_override": false
+    },
+    "msstftd": {
+        "filters": 32
+    },
+    "mssbcqtd": {
+      hop_lengths: [512, 256, 256],
+      filters: 32,
+      max_filters: 1024,
+      filters_scale: 1,
+      dilations: [1, 2, 4],
+      in_channels: 1,
+      out_channels: 1,
+      n_octaves: [9, 9, 9],
+      bins_per_octaves: [24, 36, 48]
+    },
+  },
+  "train": {
+    // TODO: Choose a suitable batch size, training epoch, and save stride
+    "batch_size": 32,
+    "max_epoch": 1000000,
+    "save_checkpoint_stride": [20],
+    "adamw": {
+        "lr": 2.0e-4,
+        "adam_b1": 0.8,
+        "adam_b2": 0.99
+    },
+    "exponential_lr": {
+        "lr_decay": 0.999
+    },
+  }
+}

egs/vocoder/gan/hifigan/exp_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true
+  },
+  "model": {
+    "generator": "hifigan",
+    "hifigan": {
+      "resblock": "2",
+      "upsample_rates": [
+        8,
+        8,
+        4
+      ],
+      "upsample_kernel_sizes": [
+        16,
+        16,
+        8
+      ],
+      "upsample_initial_channel": 256,
+      "resblock_kernel_sizes": [
+        3,
+        5,
+        7
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          2
+        ],
+        [
+          2,
+          6
+        ],
+        [
+          3,
+          12
+        ]
+      ]
+    }
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "mel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

egs/vocoder/gan/hifigan/run.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

egs/vocoder/gan/melgan/exp_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true
+  },
+  "model": {
+    "generator": "melgan",
+    "melgan": {
+      "ratios": [8, 8, 2, 2],
+      "ngf": 32,
+      "n_residual_layers": 3,
+      "num_D": 3,
+      "ndf": 16,
+      "n_layers": 4,
+      "downsampling_factor": 4
+    },
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

egs/vocoder/gan/melgan/run.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

egs/vocoder/gan/nsfhifigan/exp_config.json ADDED Viewed

	@@ -0,0 +1,83 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    "extract_pitch": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true,
+    "use_frame_pitch": true
+  },
+  "model": {
+    "generator": "nsfhifigan",
+    "nsfhifigan": {
+      "resblock": "1",
+      "harmonic_num": 8,
+      "upsample_rates": [
+        8,
+        4,
+        2,
+        2,
+        2
+      ],
+      "upsample_kernel_sizes": [
+        16,
+        8,
+        4,
+        4,
+        4
+      ],
+      "upsample_initial_channel": 768,
+      "resblock_kernel_sizes": [
+        3,
+        7,
+        11
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ]
+      ]
+    },
+    "mpd": {
+      "mpd_reshapes": [
+        2,
+        3,
+        5,
+        7,
+        11,
+        17,
+        23,
+        37
+      ],
+      "use_spectral_norm": false,
+      "discriminator_channel_multi": 1
+    }
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "mel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

egs/vocoder/gan/nsfhifigan/run.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

egs/vocoder/gan/tfr_enhanced_hifigan/README.md ADDED Viewed

	@@ -0,0 +1,185 @@

+# Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fedility Vocoder
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957)
+[![demo](https://img.shields.io/badge/Vocoder-Demo-red)](https://vocodexelysium.github.io/MS-SB-CQTD/)
+<br>
+<div align="center">
+<img src="../../../../imgs/vocoder/gan/MSSBCQTD.png" width="80%">
+</div>
+<br>
+This is the official implementation of the paper "[Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder](https://arxiv.org/abs/2311.14957)". In this recipe, we will illustrate how to train a high quality HiFi-GAN on LibriTTS, VCTK and LJSpeech via utilizing multiple Time-Frequency-Representation-based Discriminators.
+There are four stages in total:
+1. Data preparation
+2. Feature extraction
+3. Training
+4. Inference
+> **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
+> ```bash
+> cd Amphion
+> ```
+## 1. Data Preparation
+### Dataset Download
+By default, we utilize the three datasets for training: LibriTTS, VCTK and LJSpeech. How to download them is detailed in [here](../../../datasets/README.md).
+### Configuration
+Specify the dataset path in  `exp_config.json`. Note that you can change the `dataset` list to use your preferred datasets.
+```json
+"dataset": [
+    "ljspeech",
+    "vctk",
+    "libritts",
+],
+"dataset_path": {
+    // TODO: Fill in your dataset path
+    "ljspeech": "[LJSpeech dataset path]",
+    "vctk": "[VCTK dataset path]",
+    "libritts": "[LibriTTS dataset path]",
+},
+```
+## 2. Features Extraction
+For HiFiGAN, only the Mel-Spectrogram and the Output Audio are needed for training.
+### Configuration
+Specify the dataset path and the output path for saving the processed data and the training model in `exp_config.json`:
+```json
+    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/vocoder"
+    "log_dir": "ckpts/vocoder",
+    "preprocess": {
+        // TODO: Fill in the output data path. The default value is "Amphion/data"
+        "processed_dir": "data",
+        ...
+    },
+```
+### Run
+Run the `run.sh` as the preproces stage (set  `--stage 1`).
+```bash
+sh egs/vocoder/gan/tfr_enhanced_hifigan/run.sh --stage 1
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "1"`.
+## 3. Training
+### Configuration
+We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on you GPU machines.
+```json
+"train": {
+    "batch_size": 32,
+    ...
+}
+```
+### Run
+Run the `run.sh` as the training stage (set  `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/vocoder/[YourExptName]`.
+```bash
+sh egs/vocoder/gan/tfr_enhanced_hifigan/run.sh --stage 2 --name [YourExptName]
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
+## 4. Inference
+### Pretrained Vocoder Download
+We trained a HiFiGAN checkpoint with around 685 hours Speech data. The final pretrained checkpoint is released [here](../../../../pretrained/hifigan/README.md).
+### Run
+Run the `run.sh` as the training stage (set  `--stage 3`), we provide three different inference modes, including `infer_from_dataset`, `infer_from_feature`, `and infer_from audio`.
+```bash
+sh egs/vocoder/gan/tfr_enhanced_hifigan/run.sh --stage 3 \
+	--infer_mode [Your chosen inference mode] \
+	--infer_datasets [Datasets you want to inference, needed when infer_from_dataset] \
+	--infer_feature_dir [Your path to your predicted acoustic features, needed when infer_from_feature] \
+	--infer_audio_dir [Your path to your audio files, needed when infer_form_audio] \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```
+#### a. Inference from Dataset
+Run the `run.sh` with specified datasets, here is an example.
+```bash
+sh egs/vocoder/gan/tfr_enhanced_hifigan/run.sh --stage 3 \
+	--infer_mode infer_from_dataset \
+	--infer_datasets "libritts vctk ljspeech" \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```
+#### b. Inference from Features
+If you want to inference from your generated acoustic features, you should first prepare your acoustic features into the following structure:
+```plaintext
+ ┣ {infer_feature_dir}
+ ┃ ┣ mels
+ ┃ ┃ ┣ sample1.npy
+ ┃ ┃ ┣ sample2.npy
+```
+Then run the `run.sh` with specificed folder direction, here is an example.
+```bash
+sh egs/vocoder/gan/tfr_enhanced_hifigan/run.sh --stage 3 \
+	--infer_mode infer_from_feature \
+	--infer_feature_dir [Your path to your predicted acoustic features] \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```
+#### c. Inference from Audios
+If you want to inference from audios for quick analysis synthesis, you should first prepare your audios into the following structure:
+```plaintext
+ ┣ audios
+ ┃ ┣ sample1.wav
+ ┃ ┣ sample2.wav
+```
+Then run the `run.sh` with specificed folder direction, here is an example.
+```bash
+sh egs/vocoder/gan/tfr_enhanced_hifigan/run.sh --stage 3 \
+	--infer_mode infer_from_audio \
+	--infer_audio_dir [Your path to your audio files] \
+	--infer_expt_dir Amphion/ckpts/vocoder/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/vocoder/[YourExptName]/result \
+```
+## Citations
+```bibtex
+@misc{gu2023cqt,
+      title={Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder},
+      author={Yicheng Gu and Xueyao Zhang and Liumeng Xue and Zhizheng Wu},
+      year={2023},
+      eprint={2311.14957},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD}
+}
+```

egs/vocoder/gan/tfr_enhanced_hifigan/exp_config.json ADDED Viewed

	@@ -0,0 +1,118 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "model_type": "GANVocoder",
+  "dataset": [
+    "ljspeech",
+    "vctk",
+    "libritts",
+  ],
+  "dataset_path": {
+    // TODO: Fill in your dataset path
+    "ljspeech": "[dataset path]",
+    "vctk": "[dataset path]",
+    "libritts": "[dataset path]",
+  },
+  // TODO: Fill in the output log path. The default value is "Amphion/ckpts/vocoder"
+  "log_dir": "ckpts/vocoder",
+  "preprocess": {
+    // TODO: Fill in the output data path. The default value is "Amphion/data"
+    "processed_dir": "data",
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    "extract_pitch": false,
+    "extract_uv": false,
+    "extract_amplitude_phase": false,
+    "pitch_extractor": "parselmouth",
+    // Features used for model training
+    "use_mel": true,
+    "use_frame_pitch": false,
+    "use_uv": false,
+    "use_audio": true,
+    "n_mel": 100,
+    "sample_rate": 24000
+  },
+  "model": {
+    "generator": "hifigan",
+    "discriminators": [
+      "msd",
+      "mpd",
+      "mssbcqtd",
+      "msstftd",
+    ],
+    "hifigan": {
+      "resblock": "1",
+      "upsample_rates": [
+        8,
+        4,
+        2,
+        2,
+        2
+      ],
+      "upsample_kernel_sizes": [
+        16,
+        8,
+        4,
+        4,
+        4
+      ],
+      "upsample_initial_channel": 768,
+      "resblock_kernel_sizes": [
+        3,
+        5,
+        7
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ]
+      ]
+    },
+    "mpd": {
+      "mpd_reshapes": [
+        2,
+        3,
+        5,
+        7,
+        11,
+        17,
+        23,
+        37
+      ],
+      "use_spectral_norm": false,
+      "discriminator_channel_multi": 1
+    }
+  },
+  "train": {
+    "batch_size": 16,
+    "adamw": {
+      "lr": 2.0e-4,
+      "adam_b1": 0.8,
+      "adam_b2": 0.99
+    },
+    "exponential_lr": {
+      "lr_decay": 0.999
+    },
+    "criterions": [
+      "feature",
+      "discriminator",
+      "generator",
+      "mel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

egs/vocoder/gan/tfr_enhanced_hifigan/run.sh ADDED Viewed

	@@ -0,0 +1,145 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,resume:,checkpoint:,resume_type:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] Resume configuration
+    --resume) shift; resume=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; cehckpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    if [ "$resume" = true ]; then
+        echo "Automatically resume from the experimental dir..."
+        CUDA_VISIBLE_DEVICES="$gpu" accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --resume
+    else
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/vocoder/train.py \
+            --config "$exp_config" \
+            --exp_name "$exp_name" \
+            --log_level info \
+            --checkpoint "$checkpoint" \
+            --resume_type "$resume_type"
+    fi
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    echo $infer_datasets
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

examples/chinese_female_recordings.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f710270fe3857211c55aaa1f813e310e68855ff9eabaf5b249537a2d4277cc30
+size 448928

examples/chinese_male_seperated.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:009077a677b23bff3154078930e6c624d218eb0acbe78990bec88f6bf5a6e5de
+size 480044

examples/english_female_seperated.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87e75863ffb4e597467a825d019217e73d64dce1e9635de60a32559ffcb97cf4
+size 1509584

examples/english_male_recordings.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e14ebf1c554ebb25e5169b4bcda36a685538e94c531f303339bad91ff93a2288
+size 251948

examples/output/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

examples/output/chinese_female_recordings_vocalist_l1_JohnMayer.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bf6d6ef89ba2234fbc64c0ee48f81528cf49717a23a919aa8d0767ada2437113
+size 244268

examples/output/chinese_male_seperated_vocalist_l1_TaylorSwift.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e682abb072246f412133bfa313c6edf863f1d6a6db63022749f74c2c7ef01c7
+size 479788

examples/output/english_female_seperated_vocalist_l1_汪峰.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a03755cfc9aef4d26bda6370d9335625482f22f2c1f3c918dbbec3246213cee2
+size 410668

examples/output/english_male_recordings_vocalist_l1_石倚洁.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e850a0e02f2741185c3d3b642a9c292a3a297cdf262e92333b63adf98af7d450
+size 251948

models/__init__.py ADDED Viewed

File without changes

models/base/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .new_trainer import BaseTrainer
+from .new_inference import BaseInference

models/base/base_dataset.py ADDED Viewed

	@@ -0,0 +1,350 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+import numpy as np
+import torch.utils.data
+from torch.nn.utils.rnn import pad_sequence
+from utils.data_utils import *
+from processors.acoustic_extractor import cal_normalized_mel
+from text import text_to_sequence
+from text.text_token_collation import phoneIDCollation
+class BaseDataset(torch.utils.data.Dataset):
+    def __init__(self, cfg, dataset, is_valid=False):
+        """
+        Args:
+            cfg: config
+            dataset: dataset name
+            is_valid: whether to use train or valid dataset
+        """
+        assert isinstance(dataset, str)
+        # self.data_root = processed_data_dir
+        self.cfg = cfg
+        processed_data_dir = os.path.join(cfg.preprocess.processed_dir, dataset)
+        meta_file = cfg.preprocess.valid_file if is_valid else cfg.preprocess.train_file
+        self.metafile_path = os.path.join(processed_data_dir, meta_file)
+        self.metadata = self.get_metadata()
+        '''
+        load spk2id and utt2spk from json file
+            spk2id: {spk1: 0, spk2: 1, ...}
+            utt2spk: {dataset_uid: spk1, ...}
+        '''
+        if cfg.preprocess.use_spkid:
+            spk2id_path = os.path.join(processed_data_dir, cfg.preprocess.spk2id)
+            with open(spk2id_path, "r") as f:
+                self.spk2id = json.load(f)
+            utt2spk_path = os.path.join(processed_data_dir, cfg.preprocess.utt2spk)
+            self.utt2spk = dict()
+            with open(utt2spk_path, "r") as f:
+                for line in f.readlines():
+                    utt, spk = line.strip().split('\t')
+                    self.utt2spk[utt] = spk
+        if cfg.preprocess.use_uv:
+            self.utt2uv_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2uv_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.uv_dir,
+                    uid + ".npy",
+                )
+        if cfg.preprocess.use_frame_pitch:
+            self.utt2frame_pitch_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2frame_pitch_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.pitch_dir,
+                    uid + ".npy",
+                )
+        if cfg.preprocess.use_frame_energy:
+            self.utt2frame_energy_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2frame_energy_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.energy_dir,
+                    uid + ".npy",
+                )
+        if cfg.preprocess.use_mel:
+            self.utt2mel_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2mel_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.mel_dir,
+                    uid + ".npy",
+                )
+        if cfg.preprocess.use_linear:
+            self.utt2linear_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2linear_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.linear_dir,
+                    uid + ".npy",
+                )
+        if cfg.preprocess.use_audio:
+            self.utt2audio_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2audio_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.audio_dir,
+                    uid + ".npy",
+                )
+        elif cfg.preprocess.use_label:
+            self.utt2label_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2label_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.label_dir,
+                    uid + ".npy",
+                )
+        elif cfg.preprocess.use_one_hot:
+            self.utt2one_hot_path = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                self.utt2one_hot_path[utt] = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    dataset,
+                    cfg.preprocess.one_hot_dir,
+                    uid + ".npy",
+                )
+        if cfg.preprocess.use_text or cfg.preprocess.use_phone:
+            self.utt2seq = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                if cfg.preprocess.use_text:
+                    text = utt_info["Text"]
+                    sequence = text_to_sequence(text, cfg.preprocess.text_cleaners)
+                elif cfg.preprocess.use_phone:
+                    # load phoneme squence from phone file
+                    phone_path = os.path.join(processed_data_dir,
+                                            cfg.preprocess.phone_dir,
+                                            uid+'.phone'
+                                            )
+                    with open(phone_path, 'r') as fin:
+                        phones = fin.readlines()
+                        assert len(phones) == 1
+                        phones = phones[0].strip()
+                    phones_seq = phones.split(' ')
+                    phon_id_collator = phoneIDCollation(cfg, dataset=dataset)
+                    sequence = phon_id_collator.get_phone_id_sequence(cfg, phones_seq)
+                self.utt2seq[utt] = sequence
+    def get_metadata(self):
+        with open(self.metafile_path, "r", encoding="utf-8") as f:
+            metadata = json.load(f)
+        return metadata
+    def get_dataset_name(self):
+        return self.metadata[0]["Dataset"]
+    def __getitem__(self, index):
+        utt_info = self.metadata[index]
+        dataset = utt_info["Dataset"]
+        uid = utt_info["Uid"]
+        utt = "{}_{}".format(dataset, uid)
+        single_feature = dict()
+        if self.cfg.preprocess.use_spkid:
+            single_feature["spk_id"] = np.array(
+                [self.spk2id[self.utt2spk[utt]]], dtype=np.int32
+            )
+        if self.cfg.preprocess.use_mel:
+            mel = np.load(self.utt2mel_path[utt])
+            assert mel.shape[0] == self.cfg.preprocess.n_mel  # [n_mels, T]
+            if self.cfg.preprocess.use_min_max_norm_mel:
+                # do mel norm
+                mel = cal_normalized_mel(mel, utt_info["Dataset"], self.cfg.preprocess)
+            if "target_len" not in single_feature.keys():
+                single_feature["target_len"] = mel.shape[1]
+            single_feature["mel"] = mel.T  # [T, n_mels]
+        if self.cfg.preprocess.use_linear:
+            linear = np.load(self.utt2linear_path[utt])
+            if "target_len" not in single_feature.keys():
+                single_feature["target_len"] = linear.shape[1]
+            single_feature["linear"] = linear.T  # [T, n_linear]
+        if self.cfg.preprocess.use_frame_pitch:
+            frame_pitch_path = self.utt2frame_pitch_path[utt]
+            frame_pitch = np.load(frame_pitch_path)
+            if "target_len" not in single_feature.keys():
+                single_feature["target_len"] = len(frame_pitch)
+            aligned_frame_pitch = align_length(
+                frame_pitch, single_feature["target_len"]
+            )
+            single_feature["frame_pitch"] = aligned_frame_pitch
+            if self.cfg.preprocess.use_uv:
+                frame_uv_path = self.utt2uv_path[utt]
+                frame_uv = np.load(frame_uv_path)
+                aligned_frame_uv = align_length(frame_uv, single_feature["target_len"])
+                aligned_frame_uv = [
+                    0 if frame_uv else 1 for frame_uv in aligned_frame_uv
+                ]
+                aligned_frame_uv = np.array(aligned_frame_uv)
+                single_feature["frame_uv"] = aligned_frame_uv
+        if self.cfg.preprocess.use_frame_energy:
+            frame_energy_path = self.utt2frame_energy_path[utt]
+            frame_energy = np.load(frame_energy_path)
+            if "target_len" not in single_feature.keys():
+                single_feature["target_len"] = len(frame_energy)
+            aligned_frame_energy = align_length(
+                frame_energy, single_feature["target_len"]
+            )
+            single_feature["frame_energy"] = aligned_frame_energy
+        if self.cfg.preprocess.use_audio:
+            audio = np.load(self.utt2audio_path[utt])
+            single_feature["audio"] = audio
+            single_feature["audio_len"] = audio.shape[0]
+        if self.cfg.preprocess.use_phone or self.cfg.preprocess.use_text:
+            single_feature["phone_seq"] = np.array(self.utt2seq[utt])
+            single_feature["phone_len"] = len(self.utt2seq[utt])
+        return single_feature
+    def __len__(self):
+        return len(self.metadata)
+class BaseCollator(object):
+    """Zero-pads model inputs and targets based on number of frames per step"""
+    def __init__(self, cfg):
+        self.cfg = cfg
+    def __call__(self, batch):
+        packed_batch_features = dict()
+        # mel: [b, T, n_mels]
+        # frame_pitch, frame_energy: [1, T]
+        # target_len: [1]
+        # spk_id: [b, 1]
+        # mask: [b, T, 1]
+        for key in batch[0].keys():
+            if key == "target_len":
+                packed_batch_features["target_len"] = torch.LongTensor(
+                    [b["target_len"] for b in batch]
+                )
+                masks = [
+                    torch.ones((b["target_len"], 1), dtype=torch.long) for b in batch
+                ]
+                packed_batch_features["mask"] = pad_sequence(
+                    masks, batch_first=True, padding_value=0
+                )
+            elif key == "phone_len":
+                packed_batch_features["phone_len"] = torch.LongTensor(
+                    [b["phone_len"] for b in batch]
+                )
+                masks = [
+                    torch.ones((b["phone_len"], 1), dtype=torch.long) for b in batch
+                ]
+                packed_batch_features["phn_mask"] = pad_sequence(
+                    masks, batch_first=True, padding_value=0
+                )
+            elif key == "audio_len":
+                packed_batch_features["audio_len"] = torch.LongTensor(
+                    [b["audio_len"] for b in batch]
+                )
+                masks = [
+                    torch.ones((b["audio_len"], 1), dtype=torch.long) for b in batch
+                ]
+            else:
+                values = [torch.from_numpy(b[key]) for b in batch]
+                packed_batch_features[key] = pad_sequence(
+                    values, batch_first=True, padding_value=0
+                )
+        return packed_batch_features
+class BaseTestDataset(torch.utils.data.Dataset):
+    def __init__(self, cfg, args):
+        raise NotImplementedError
+    def get_metadata(self):
+        raise NotImplementedError
+    def __getitem__(self, index):
+        raise NotImplementedError
+    def __len__(self):
+        return len(self.metadata)
+class BaseTestCollator(object):
+    """Zero-pads model inputs and targets based on number of frames per step"""
+    def __init__(self, cfg):
+        raise NotImplementedError
+    def __call__(self, batch):
+        raise NotImplementedError

models/base/base_inference.py ADDED Viewed

	@@ -0,0 +1,220 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import os
+import re
+import time
+from pathlib import Path
+import torch
+from torch.utils.data import DataLoader
+from tqdm import tqdm
+from models.vocoders.vocoder_inference import synthesis
+from torch.utils.data import DataLoader
+from utils.util import set_all_random_seed
+from utils.util import load_config
+def parse_vocoder(vocoder_dir):
+    r"""Parse vocoder config"""
+    vocoder_dir = os.path.abspath(vocoder_dir)
+    ckpt_list = [ckpt for ckpt in Path(vocoder_dir).glob("*.pt")]
+    ckpt_list.sort(key=lambda x: int(x.stem), reverse=True)
+    ckpt_path = str(ckpt_list[0])
+    vocoder_cfg = load_config(os.path.join(vocoder_dir, "args.json"), lowercase=True)
+    vocoder_cfg.model.bigvgan = vocoder_cfg.vocoder
+    return vocoder_cfg, ckpt_path
+class BaseInference(object):
+    def __init__(self, cfg, args):
+        self.cfg = cfg
+        self.args = args
+        self.model_type = cfg.model_type
+        self.avg_rtf = list()
+        set_all_random_seed(10086)
+        os.makedirs(args.output_dir, exist_ok=True)
+        if torch.cuda.is_available():
+            self.device = torch.device("cuda")
+        else:
+            self.device = torch.device("cpu")
+            torch.set_num_threads(10)  # inference on 1 core cpu.
+        # Load acoustic model
+        self.model = self.create_model().to(self.device)
+        state_dict = self.load_state_dict()
+        self.load_model(state_dict)
+        self.model.eval()
+        # Load vocoder model if necessary
+        if self.args.checkpoint_dir_vocoder is not None:
+            self.get_vocoder_info()
+    def create_model(self):
+        raise NotImplementedError
+    def load_state_dict(self):
+        self.checkpoint_file = self.args.checkpoint_file
+        if self.checkpoint_file is None:
+            assert self.args.checkpoint_dir is not None
+            checkpoint_path = os.path.join(self.args.checkpoint_dir, "checkpoint")
+            checkpoint_filename = open(checkpoint_path).readlines()[-1].strip()
+            self.checkpoint_file = os.path.join(
+                self.args.checkpoint_dir, checkpoint_filename
+            )
+        self.checkpoint_dir = os.path.split(self.checkpoint_file)[0]
+        print("Restore acoustic model from {}".format(self.checkpoint_file))
+        raw_state_dict = torch.load(self.checkpoint_file, map_location=self.device)
+        self.am_restore_step = re.findall(r"step-(.+?)_loss", self.checkpoint_file)[0]
+        return raw_state_dict
+    def load_model(self, model):
+        raise NotImplementedError
+    def get_vocoder_info(self):
+        self.checkpoint_dir_vocoder = self.args.checkpoint_dir_vocoder
+        self.vocoder_cfg = os.path.join(
+            os.path.dirname(self.checkpoint_dir_vocoder), "args.json"
+        )
+        self.cfg.vocoder = load_config(self.vocoder_cfg, lowercase=True)
+        self.vocoder_tag = self.checkpoint_dir_vocoder.split("/")[-2].split(":")[-1]
+        self.vocoder_steps = self.checkpoint_dir_vocoder.split("/")[-1].split(".")[0]
+    def build_test_utt_data(self):
+        raise NotImplementedError
+    def build_testdata_loader(self, args, target_speaker=None):
+        datasets, collate = self.build_test_dataset()
+        self.test_dataset = datasets(self.cfg, args, target_speaker)
+        self.test_collate = collate(self.cfg)
+        self.test_batch_size = min(
+            self.cfg.train.batch_size, len(self.test_dataset.metadata)
+        )
+        test_loader = DataLoader(
+            self.test_dataset,
+            collate_fn=self.test_collate,
+            num_workers=self.args.num_workers,
+            batch_size=self.test_batch_size,
+            shuffle=False,
+        )
+        return test_loader
+    def inference_each_batch(self, batch_data):
+        raise NotImplementedError
+    def inference_for_batches(self, args, target_speaker=None):
+        ###### Construct test_batch ######
+        loader = self.build_testdata_loader(args, target_speaker)
+        n_batch = len(loader)
+        now = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))
+        print(
+            "Model eval time: {}, batch_size = {}, n_batch = {}".format(
+                now, self.test_batch_size, n_batch
+            )
+        )
+        self.model.eval()
+        ###### Inference for each batch ######
+        pred_res = []
+        with torch.no_grad():
+            for i, batch_data in enumerate(loader if n_batch == 1 else tqdm(loader)):
+                # Put the data to device
+                for k, v in batch_data.items():
+                    batch_data[k] = batch_data[k].to(self.device)
+                y_pred, stats = self.inference_each_batch(batch_data)
+                pred_res += y_pred
+        return pred_res
+    def inference(self, feature):
+        raise NotImplementedError
+    def synthesis_by_vocoder(self, pred):
+        audios_pred = synthesis(
+            self.vocoder_cfg,
+            self.checkpoint_dir_vocoder,
+            len(pred),
+            pred,
+        )
+        return audios_pred
+    def __call__(self, utt):
+        feature = self.build_test_utt_data(utt)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = self.inference(feature)[0]
+        time_used = time.time() - start_time
+        rtf = time_used / (
+            outputs.shape[1]
+            * self.cfg.preprocess.hop_size
+            / self.cfg.preprocess.sample_rate
+        )
+        print("Time used: {:.3f}, RTF: {:.4f}".format(time_used, rtf))
+        self.avg_rtf.append(rtf)
+        audios = outputs.cpu().squeeze().numpy().reshape(-1, 1)
+        return audios
+def base_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="config.json", help="json files for configurations."
+    )
+    parser.add_argument("--use_ddp_inference", default=False)
+    parser.add_argument("--n_workers", default=1, type=int)
+    parser.add_argument("--local_rank", default=-1, type=int)
+    parser.add_argument(
+        "--batch_size", default=1, type=int, help="Batch size for inference"
+    )
+    parser.add_argument(
+        "--num_workers",
+        default=1,
+        type=int,
+        help="Worker number for inference dataloader",
+    )
+    parser.add_argument(
+        "--checkpoint_dir",
+        type=str,
+        default=None,
+        help="Checkpoint dir including model file and configuration",
+    )
+    parser.add_argument(
+        "--checkpoint_file", help="checkpoint file", type=str, default=None
+    )
+    parser.add_argument(
+        "--test_list", help="test utterance list for testing", type=str, default=None
+    )
+    parser.add_argument(
+        "--checkpoint_dir_vocoder",
+        help="Vocoder's checkpoint dir including model file and configuration",
+        type=str,
+        default=None,
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=None,
+        help="Output dir for saving generated results",
+    )
+    return parser
+if __name__ == "__main__":
+    parser = base_parser()
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    # Build inference
+    inference = BaseInference(cfg, args)
+    inference()

models/base/base_sampler.py ADDED Viewed

	@@ -0,0 +1,136 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import math
+import random
+from torch.utils.data import ConcatDataset, Dataset
+from torch.utils.data.sampler import (
+    BatchSampler,
+    RandomSampler,
+    Sampler,
+    SequentialSampler,
+)
+class ScheduledSampler(Sampler):
+    """A sampler that samples data from a given concat-dataset.
+    Args:
+        concat_dataset (ConcatDataset): a concatenated dataset consisting of all datasets
+        batch_size (int): batch size
+        holistic_shuffle (bool): whether to shuffle the whole dataset or not
+        logger (logging.Logger): logger to print warning message
+    Usage:
+        For cfg.train.batch_size = 3, cfg.train.holistic_shuffle = False, cfg.train.drop_last = True:
+        >>> list(ScheduledSampler(ConcatDataset([0, 1, 2], [3, 4, 5], [6, 7, 8]])))
+        [3, 4, 5, 0, 1, 2, 6, 7, 8]
+    """
+    def __init__(
+        self,
+        concat_dataset,
+        batch_size,
+        holistic_shuffle,
+        logger=None,
+        loader_type="train",
+    ):
+        if not isinstance(concat_dataset, ConcatDataset):
+            raise ValueError(
+                "concat_dataset must be an instance of ConcatDataset, but got {}".format(
+                    type(concat_dataset)
+                )
+            )
+        if not isinstance(batch_size, int):
+            raise ValueError(
+                "batch_size must be an integer, but got {}".format(type(batch_size))
+            )
+        if not isinstance(holistic_shuffle, bool):
+            raise ValueError(
+                "holistic_shuffle must be a boolean, but got {}".format(
+                    type(holistic_shuffle)
+                )
+            )
+        self.concat_dataset = concat_dataset
+        self.batch_size = batch_size
+        self.holistic_shuffle = holistic_shuffle
+        affected_dataset_name = []
+        affected_dataset_len = []
+        for dataset in concat_dataset.datasets:
+            dataset_len = len(dataset)
+            dataset_name = dataset.get_dataset_name()
+            if dataset_len < batch_size:
+                affected_dataset_name.append(dataset_name)
+                affected_dataset_len.append(dataset_len)
+        self.type = loader_type
+        for dataset_name, dataset_len in zip(
+            affected_dataset_name, affected_dataset_len
+        ):
+            if not loader_type == "valid":
+                logger.warning(
+                    "The {} dataset {} has a length of {}, which is smaller than the batch size {}. This may cause unexpected behavior.".format(
+                        loader_type, dataset_name, dataset_len, batch_size
+                    )
+                )
+    def __len__(self):
+        # the number of batches with drop last
+        num_of_batches = sum(
+            [
+                math.floor(len(dataset) / self.batch_size)
+                for dataset in self.concat_dataset.datasets
+            ]
+        )
+        # if samples are not enough for one batch, we don't drop last
+        if self.type == "valid" and num_of_batches < 1:
+            return len(self.concat_dataset)
+        return num_of_batches * self.batch_size
+    def __iter__(self):
+        iters = []
+        for dataset in self.concat_dataset.datasets:
+            iters.append(
+                SequentialSampler(dataset).__iter__()
+                if not self.holistic_shuffle
+                else RandomSampler(dataset).__iter__()
+            )
+        # e.g. [0, 200, 400]
+        init_indices = [0] + self.concat_dataset.cumulative_sizes[:-1]
+        output_batches = []
+        for dataset_idx in range(len(self.concat_dataset.datasets)):
+            cur_batch = []
+            for idx in iters[dataset_idx]:
+                cur_batch.append(idx + init_indices[dataset_idx])
+                if len(cur_batch) == self.batch_size:
+                    output_batches.append(cur_batch)
+                    cur_batch = []
+            # if loader_type is valid, we don't need to drop last
+            if self.type == "valid" and len(cur_batch) > 0:
+                output_batches.append(cur_batch)
+        # force drop last in training
+        random.shuffle(output_batches)
+        output_indices = [item for sublist in output_batches for item in sublist]
+        return iter(output_indices)
+def build_samplers(concat_dataset: Dataset, cfg, logger, loader_type):
+    sampler = ScheduledSampler(
+        concat_dataset,
+        cfg.train.batch_size,
+        cfg.train.sampler.holistic_shuffle,
+        logger,
+        loader_type,
+    )
+    batch_sampler = BatchSampler(
+        sampler,
+        cfg.train.batch_size,
+        cfg.train.sampler.drop_last if not loader_type == "valid" else False,
+    )
+    return sampler, batch_sampler

models/base/base_trainer.py ADDED Viewed

	@@ -0,0 +1,348 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import collections
+import json
+import os
+import sys
+import time
+import torch
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel
+from torch.utils.data import ConcatDataset, DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from models.base.base_sampler import BatchSampler
+from utils.util import (
+    Logger,
+    remove_older_ckpt,
+    save_config,
+    set_all_random_seed,
+    ValueWindow,
+)
+class BaseTrainer(object):
+    def __init__(self, args, cfg):
+        self.args = args
+        self.log_dir = args.log_dir
+        self.cfg = cfg
+        self.checkpoint_dir = os.path.join(args.log_dir, "checkpoints")
+        os.makedirs(self.checkpoint_dir, exist_ok=True)
+        if not cfg.train.ddp or args.local_rank == 0:
+            self.sw = SummaryWriter(os.path.join(args.log_dir, "events"))
+            self.logger = self.build_logger()
+        self.time_window = ValueWindow(50)
+        self.step = 0
+        self.epoch = -1
+        self.max_epochs = self.cfg.train.epochs
+        self.max_steps = self.cfg.train.max_steps
+        # set random seed & init distributed training
+        set_all_random_seed(self.cfg.train.random_seed)
+        if cfg.train.ddp:
+            dist.init_process_group(backend="nccl")
+        if cfg.model_type not in ["AutoencoderKL", "AudioLDM"]:
+            self.singers = self.build_singers_lut()
+        # setup data_loader
+        self.data_loader = self.build_data_loader()
+        # setup model & enable distributed training
+        self.model = self.build_model()
+        print(self.model)
+        if isinstance(self.model, dict):
+            for key, value in self.model.items():
+                value.cuda(self.args.local_rank)
+                if key == "PQMF":
+                    continue
+                if cfg.train.ddp:
+                    self.model[key] = DistributedDataParallel(
+                        value, device_ids=[self.args.local_rank]
+                    )
+        else:
+            self.model.cuda(self.args.local_rank)
+            if cfg.train.ddp:
+                self.model = DistributedDataParallel(
+                    self.model, device_ids=[self.args.local_rank]
+                )
+        # create criterion
+        self.criterion = self.build_criterion()
+        if isinstance(self.criterion, dict):
+            for key, value in self.criterion.items():
+                self.criterion[key].cuda(args.local_rank)
+        else:
+            self.criterion.cuda(self.args.local_rank)
+        # optimizer
+        self.optimizer = self.build_optimizer()
+        self.scheduler = self.build_scheduler()
+        # save config file
+        self.config_save_path = os.path.join(self.checkpoint_dir, "args.json")
+    def build_logger(self):
+        log_file = os.path.join(self.checkpoint_dir, "train.log")
+        logger = Logger(log_file, level=self.args.log_level).logger
+        return logger
+    def build_dataset(self):
+        raise NotImplementedError
+    def build_data_loader(self):
+        Dataset, Collator = self.build_dataset()
+        # build dataset instance for each dataset and combine them by ConcatDataset
+        datasets_list = []
+        for dataset in self.cfg.dataset:
+            subdataset = Dataset(self.cfg, dataset, is_valid=False)
+            datasets_list.append(subdataset)
+        train_dataset = ConcatDataset(datasets_list)
+        train_collate = Collator(self.cfg)
+        # TODO: multi-GPU training
+        if self.cfg.train.ddp:
+            raise NotImplementedError("DDP is not supported yet.")
+        # sampler will provide indices to batch_sampler, which will perform batching and yield batch indices
+        batch_sampler = BatchSampler(
+            cfg=self.cfg, concat_dataset=train_dataset, dataset_list=datasets_list
+        )
+        # use batch_sampler argument instead of (sampler, shuffle, drop_last, batch_size)
+        train_loader = DataLoader(
+            train_dataset,
+            collate_fn=train_collate,
+            num_workers=self.args.num_workers,
+            batch_sampler=batch_sampler,
+            pin_memory=False,
+        )
+        if not self.cfg.train.ddp or self.args.local_rank == 0:
+            datasets_list = []
+            for dataset in self.cfg.dataset:
+                subdataset = Dataset(self.cfg, dataset, is_valid=True)
+                datasets_list.append(subdataset)
+            valid_dataset = ConcatDataset(datasets_list)
+            valid_collate = Collator(self.cfg)
+            batch_sampler = BatchSampler(
+                cfg=self.cfg, concat_dataset=valid_dataset, dataset_list=datasets_list
+            )
+            valid_loader = DataLoader(
+                valid_dataset,
+                collate_fn=valid_collate,
+                num_workers=1,
+                batch_sampler=batch_sampler,
+            )
+        else:
+            raise NotImplementedError("DDP is not supported yet.")
+            # valid_loader = None
+        data_loader = {"train": train_loader, "valid": valid_loader}
+        return data_loader
+    def build_singers_lut(self):
+        # combine singers
+        if not os.path.exists(os.path.join(self.log_dir, self.cfg.preprocess.spk2id)):
+            singers = collections.OrderedDict()
+        else:
+            with open(
+                os.path.join(self.log_dir, self.cfg.preprocess.spk2id), "r"
+            ) as singer_file:
+                singers = json.load(singer_file)
+        singer_count = len(singers)
+        for dataset in self.cfg.dataset:
+            singer_lut_path = os.path.join(
+                self.cfg.preprocess.processed_dir, dataset, self.cfg.preprocess.spk2id
+            )
+            with open(singer_lut_path, "r") as singer_lut_path:
+                singer_lut = json.load(singer_lut_path)
+            for singer in singer_lut.keys():
+                if singer not in singers:
+                    singers[singer] = singer_count
+                    singer_count += 1
+        with open(
+            os.path.join(self.log_dir, self.cfg.preprocess.spk2id), "w"
+        ) as singer_file:
+            json.dump(singers, singer_file, indent=4, ensure_ascii=False)
+        print(
+            "singers have been dumped to {}".format(
+                os.path.join(self.log_dir, self.cfg.preprocess.spk2id)
+            )
+        )
+        return singers
+    def build_model(self):
+        raise NotImplementedError()
+    def build_optimizer(self):
+        raise NotImplementedError
+    def build_scheduler(self):
+        raise NotImplementedError()
+    def build_criterion(self):
+        raise NotImplementedError
+    def get_state_dict(self):
+        raise NotImplementedError
+    def save_config_file(self):
+        save_config(self.config_save_path, self.cfg)
+    # TODO, save without module.
+    def save_checkpoint(self, state_dict, saved_model_path):
+        torch.save(state_dict, saved_model_path)
+    def load_checkpoint(self):
+        checkpoint_path = os.path.join(self.checkpoint_dir, "checkpoint")
+        assert os.path.exists(checkpoint_path)
+        checkpoint_filename = open(checkpoint_path).readlines()[-1].strip()
+        model_path = os.path.join(self.checkpoint_dir, checkpoint_filename)
+        assert os.path.exists(model_path)
+        if not self.cfg.train.ddp or self.args.local_rank == 0:
+            self.logger.info(f"Re(store) from {model_path}")
+        checkpoint = torch.load(model_path, map_location="cpu")
+        return checkpoint
+    def load_model(self, checkpoint):
+        raise NotImplementedError
+    def restore(self):
+        checkpoint = self.load_checkpoint()
+        self.load_model(checkpoint)
+    def train_step(self, data):
+        raise NotImplementedError(
+            f"Need to implement function {sys._getframe().f_code.co_name} in "
+            f"your sub-class of {self.__class__.__name__}. "
+        )
+    @torch.no_grad()
+    def eval_step(self):
+        raise NotImplementedError(
+            f"Need to implement function {sys._getframe().f_code.co_name} in "
+            f"your sub-class of {self.__class__.__name__}. "
+        )
+    def write_summary(self, losses, stats):
+        raise NotImplementedError(
+            f"Need to implement function {sys._getframe().f_code.co_name} in "
+            f"your sub-class of {self.__class__.__name__}. "
+        )
+    def write_valid_summary(self, losses, stats):
+        raise NotImplementedError(
+            f"Need to implement function {sys._getframe().f_code.co_name} in "
+            f"your sub-class of {self.__class__.__name__}. "
+        )
+    def echo_log(self, losses, mode="Training"):
+        message = [
+            "{} - Epoch {} Step {}: [{:.3f} s/step]".format(
+                mode, self.epoch + 1, self.step, self.time_window.average
+            )
+        ]
+        for key in sorted(losses.keys()):
+            if isinstance(losses[key], dict):
+                for k, v in losses[key].items():
+                    message.append(
+                        str(k).split("/")[-1] + "=" + str(round(float(v), 5))
+                    )
+            else:
+                message.append(
+                    str(key).split("/")[-1] + "=" + str(round(float(losses[key]), 5))
+                )
+        self.logger.info(", ".join(message))
+    def eval_epoch(self):
+        self.logger.info("Validation...")
+        valid_losses = {}
+        for i, batch_data in enumerate(self.data_loader["valid"]):
+            for k, v in batch_data.items():
+                if isinstance(v, torch.Tensor):
+                    batch_data[k] = v.cuda()
+            valid_loss, valid_stats, total_valid_loss = self.eval_step(batch_data, i)
+            for key in valid_loss:
+                if key not in valid_losses:
+                    valid_losses[key] = 0
+                valid_losses[key] += valid_loss[key]
+        # Add mel and audio to the Tensorboard
+        # Average loss
+        for key in valid_losses:
+            valid_losses[key] /= i + 1
+        self.echo_log(valid_losses, "Valid")
+        return valid_losses, valid_stats
+    def train_epoch(self):
+        for i, batch_data in enumerate(self.data_loader["train"]):
+            start_time = time.time()
+            # Put the data to cuda device
+            for k, v in batch_data.items():
+                if isinstance(v, torch.Tensor):
+                    batch_data[k] = v.cuda(self.args.local_rank)
+            # Training step
+            train_losses, train_stats, total_loss = self.train_step(batch_data)
+            self.time_window.append(time.time() - start_time)
+            if self.args.local_rank == 0 or not self.cfg.train.ddp:
+                if self.step % self.args.stdout_interval == 0:
+                    self.echo_log(train_losses, "Training")
+                if self.step % self.cfg.train.save_summary_steps == 0:
+                    self.logger.info(f"Save summary as step {self.step}")
+                    self.write_summary(train_losses, train_stats)
+                if (
+                    self.step % self.cfg.train.save_checkpoints_steps == 0
+                    and self.step != 0
+                ):
+                    saved_model_name = "step-{:07d}_loss-{:.4f}.pt".format(
+                        self.step, total_loss
+                    )
+                    saved_model_path = os.path.join(
+                        self.checkpoint_dir, saved_model_name
+                    )
+                    saved_state_dict = self.get_state_dict()
+                    self.save_checkpoint(saved_state_dict, saved_model_path)
+                    self.save_config_file()
+                    # keep max n models
+                    remove_older_ckpt(
+                        saved_model_name,
+                        self.checkpoint_dir,
+                        max_to_keep=self.cfg.train.keep_checkpoint_max,
+                    )
+                if self.step != 0 and self.step % self.cfg.train.valid_interval == 0:
+                    if isinstance(self.model, dict):
+                        for key in self.model.keys():
+                            self.model[key].eval()
+                    else:
+                        self.model.eval()
+                    # Evaluate one epoch and get average loss
+                    valid_losses, valid_stats = self.eval_epoch()
+                    if isinstance(self.model, dict):
+                        for key in self.model.keys():
+                            self.model[key].train()
+                    else:
+                        self.model.train()
+                    # Write validation losses to summary.
+                    self.write_valid_summary(valid_losses, valid_stats)
+            self.step += 1
+    def train(self):
+        for epoch in range(max(0, self.epoch), self.max_epochs):
+            self.train_epoch()
+            self.epoch += 1
+            if self.step > self.max_steps:
+                self.logger.info("Training finished!")
+                break

models/base/new_dataset.py ADDED Viewed

	@@ -0,0 +1,50 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import json
+import os
+from abc import abstractmethod
+from pathlib import Path
+import json5
+import torch
+import yaml
+# TODO: for training and validating
+class BaseDataset(torch.utils.data.Dataset):
+    r"""Base dataset for training and validating."""
+    def __init__(self, args, cfg, is_valid=False):
+        pass
+class BaseTestDataset(torch.utils.data.Dataset):
+    r"""Test dataset for inference."""
+    def __init__(self, args=None, cfg=None, infer_type="from_dataset"):
+        assert infer_type in ["from_dataset", "from_file"]
+        self.args = args
+        self.cfg = cfg
+        self.infer_type = infer_type
+    @abstractmethod
+    def __getitem__(self, index):
+        pass
+    def __len__(self):
+        return len(self.metadata)
+    def get_metadata(self):
+        path = Path(self.args.source)
+        if path.suffix == ".json" or path.suffix == ".jsonc":
+            metadata = json5.load(open(self.args.source, "r"))
+        elif path.suffix == ".yaml" or path.suffix == ".yml":
+            metadata = yaml.full_load(open(self.args.source, "r"))
+        else:
+            raise ValueError(f"Unsupported file type: {path.suffix}")
+        return metadata

models/base/new_inference.py ADDED Viewed

	@@ -0,0 +1,249 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import random
+import re
+import time
+from abc import abstractmethod
+from pathlib import Path
+import accelerate
+import json5
+import numpy as np
+import torch
+from accelerate.logging import get_logger
+from torch.utils.data import DataLoader
+from models.vocoders.vocoder_inference import synthesis
+from utils.io import save_audio
+from utils.util import load_config
+from utils.audio_slicer import is_silence
+EPS = 1.0e-12
+class BaseInference(object):
+    def __init__(self, args=None, cfg=None, infer_type="from_dataset"):
+        super().__init__()
+        start = time.monotonic_ns()
+        self.args = args
+        self.cfg = cfg
+        assert infer_type in ["from_dataset", "from_file"]
+        self.infer_type = infer_type
+        # init with accelerate
+        self.accelerator = accelerate.Accelerator()
+        self.accelerator.wait_for_everyone()
+        # Use accelerate logger for distributed inference
+        with self.accelerator.main_process_first():
+            self.logger = get_logger("inference", log_level=args.log_level)
+        # Log some info
+        self.logger.info("=" * 56)
+        self.logger.info("||\t\t" + "New inference process started." + "\t\t||")
+        self.logger.info("=" * 56)
+        self.logger.info("\n")
+        self.logger.debug(f"Using {args.log_level.upper()} logging level.")
+        self.acoustics_dir = args.acoustics_dir
+        self.logger.debug(f"Acoustic dir: {args.acoustics_dir}")
+        self.vocoder_dir = args.vocoder_dir
+        self.logger.debug(f"Vocoder dir: {args.vocoder_dir}")
+        # should be in svc inferencer
+        # self.target_singer = args.target_singer
+        # self.logger.info(f"Target singers: {args.target_singer}")
+        # self.trans_key = args.trans_key
+        # self.logger.info(f"Trans key: {args.trans_key}")
+        os.makedirs(args.output_dir, exist_ok=True)
+        # set random seed
+        with self.accelerator.main_process_first():
+            start = time.monotonic_ns()
+            self._set_random_seed(self.cfg.train.random_seed)
+            end = time.monotonic_ns()
+            self.logger.debug(
+                f"Setting random seed done in {(end - start) / 1e6:.2f}ms"
+            )
+            self.logger.debug(f"Random seed: {self.cfg.train.random_seed}")
+        # setup data_loader
+        with self.accelerator.main_process_first():
+            self.logger.info("Building dataset...")
+            start = time.monotonic_ns()
+            self.test_dataloader = self._build_dataloader()
+            end = time.monotonic_ns()
+            self.logger.info(f"Building dataset done in {(end - start) / 1e6:.2f}ms")
+        # setup model
+        with self.accelerator.main_process_first():
+            self.logger.info("Building model...")
+            start = time.monotonic_ns()
+            self.model = self._build_model()
+            end = time.monotonic_ns()
+            # self.logger.debug(self.model)
+            self.logger.info(f"Building model done in {(end - start) / 1e6:.3f}ms")
+        # init with accelerate
+        self.logger.info("Initializing accelerate...")
+        start = time.monotonic_ns()
+        self.accelerator = accelerate.Accelerator()
+        self.model = self.accelerator.prepare(self.model)
+        end = time.monotonic_ns()
+        self.accelerator.wait_for_everyone()
+        self.logger.info(f"Initializing accelerate done in {(end - start) / 1e6:.3f}ms")
+        with self.accelerator.main_process_first():
+            self.logger.info("Loading checkpoint...")
+            start = time.monotonic_ns()
+            # TODO: Also, suppose only use latest one yet
+            self.__load_model(os.path.join(args.acoustics_dir, "checkpoint"))
+            end = time.monotonic_ns()
+            self.logger.info(f"Loading checkpoint done in {(end - start) / 1e6:.3f}ms")
+        self.model.eval()
+        self.accelerator.wait_for_everyone()
+    ### Abstract methods ###
+    @abstractmethod
+    def _build_test_dataset(self):
+        pass
+    @abstractmethod
+    def _build_model(self):
+        pass
+    @abstractmethod
+    @torch.inference_mode()
+    def _inference_each_batch(self, batch_data):
+        pass
+    ### Abstract methods end ###
+    @torch.inference_mode()
+    def inference(self):
+        for i, batch in enumerate(self.test_dataloader):
+            y_pred = self._inference_each_batch(batch).cpu()
+            mel_min, mel_max = self.test_dataset.target_mel_extrema
+            y_pred = (y_pred + 1.0) / 2.0 * (mel_max - mel_min + EPS) + mel_min
+            y_ls = y_pred.chunk(self.test_batch_size)
+            tgt_ls = batch["target_len"].cpu().chunk(self.test_batch_size)
+            j = 0
+            for it, l in zip(y_ls, tgt_ls):
+                l = l.item()
+                it = it.squeeze(0)[:l]
+                uid = self.test_dataset.metadata[i * self.test_batch_size + j]["Uid"]
+                torch.save(it, os.path.join(self.args.output_dir, f"{uid}.pt"))
+                j += 1
+        vocoder_cfg, vocoder_ckpt = self._parse_vocoder(self.args.vocoder_dir)
+        res = synthesis(
+            cfg=vocoder_cfg,
+            vocoder_weight_file=vocoder_ckpt,
+            n_samples=None,
+            pred=[
+                torch.load(
+                    os.path.join(self.args.output_dir, "{}.pt".format(i["Uid"]))
+                ).numpy(force=True)
+                for i in self.test_dataset.metadata
+            ],
+        )
+        output_audio_files = []
+        for it, wav in zip(self.test_dataset.metadata, res):
+            uid = it["Uid"]
+            file = os.path.join(self.args.output_dir, f"{uid}.wav")
+            output_audio_files.append(file)
+            wav = wav.numpy(force=True)
+            save_audio(
+                file,
+                wav,
+                self.cfg.preprocess.sample_rate,
+                add_silence=False,
+                turn_up=not is_silence(wav, self.cfg.preprocess.sample_rate),
+            )
+            os.remove(os.path.join(self.args.output_dir, f"{uid}.pt"))
+        return sorted(output_audio_files)
+    # TODO: LEGACY CODE
+    def _build_dataloader(self):
+        datasets, collate = self._build_test_dataset()
+        self.test_dataset = datasets(self.args, self.cfg, self.infer_type)
+        self.test_collate = collate(self.cfg)
+        self.test_batch_size = min(
+            self.cfg.train.batch_size, len(self.test_dataset.metadata)
+        )
+        test_dataloader = DataLoader(
+            self.test_dataset,
+            collate_fn=self.test_collate,
+            num_workers=1,
+            batch_size=self.test_batch_size,
+            shuffle=False,
+        )
+        return test_dataloader
+    def __load_model(self, checkpoint_dir: str = None, checkpoint_path: str = None):
+        r"""Load model from checkpoint. If checkpoint_path is None, it will
+        load the latest checkpoint in checkpoint_dir. If checkpoint_path is not
+        None, it will load the checkpoint specified by checkpoint_path. **Only use this
+        method after** ``accelerator.prepare()``.
+        """
+        if checkpoint_path is None:
+            ls = []
+            for i in Path(checkpoint_dir).iterdir():
+                if re.match(r"epoch-\d+_step-\d+_loss-[\d.]+", str(i.stem)):
+                    ls.append(i)
+            ls.sort(
+                key=lambda x: int(x.stem.split("_")[-3].split("-")[-1]), reverse=True
+            )
+            checkpoint_path = ls[0]
+        else:
+            checkpoint_path = Path(checkpoint_path)
+        self.accelerator.load_state(str(checkpoint_path))
+        # set epoch and step
+        self.epoch = int(checkpoint_path.stem.split("_")[-3].split("-")[-1])
+        self.step = int(checkpoint_path.stem.split("_")[-2].split("-")[-1])
+        return str(checkpoint_path)
+    @staticmethod
+    def _set_random_seed(seed):
+        r"""Set random seed for all possible random modules."""
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.random.manual_seed(seed)
+    @staticmethod
+    def _parse_vocoder(vocoder_dir):
+        r"""Parse vocoder config"""
+        vocoder_dir = os.path.abspath(vocoder_dir)
+        ckpt_list = [ckpt for ckpt in Path(vocoder_dir).glob("*.pt")]
+        ckpt_list.sort(key=lambda x: int(x.stem), reverse=True)
+        ckpt_path = str(ckpt_list[0])
+        vocoder_cfg = load_config(
+            os.path.join(vocoder_dir, "args.json"), lowercase=True
+        )
+        return vocoder_cfg, ckpt_path
+    @staticmethod
+    def __count_parameters(model):
+        return sum(p.numel() for p in model.parameters())
+    def __dump_cfg(self, path):
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+        json5.dump(
+            self.cfg,
+            open(path, "w"),
+            indent=4,
+            sort_keys=True,
+            ensure_ascii=False,
+            quote_keys=True,
+        )

models/base/new_trainer.py ADDED Viewed

	@@ -0,0 +1,722 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import json
+import os
+import random
+import shutil
+import time
+from abc import abstractmethod
+from pathlib import Path
+import accelerate
+import json5
+import numpy as np
+import torch
+from accelerate.logging import get_logger
+from accelerate.utils import ProjectConfiguration
+from torch.utils.data import ConcatDataset, DataLoader
+from tqdm import tqdm
+from models.base.base_sampler import build_samplers
+from optimizer.optimizers import NoamLR
+class BaseTrainer(object):
+    r"""The base trainer for all tasks. Any trainer should inherit from this class."""
+    def __init__(self, args=None, cfg=None):
+        super().__init__()
+        self.args = args
+        self.cfg = cfg
+        cfg.exp_name = args.exp_name
+        # init with accelerate
+        self._init_accelerator()
+        self.accelerator.wait_for_everyone()
+        # Use accelerate logger for distributed training
+        with self.accelerator.main_process_first():
+            self.logger = get_logger(args.exp_name, log_level=args.log_level)
+        # Log some info
+        self.logger.info("=" * 56)
+        self.logger.info("||\t\t" + "New training process started." + "\t\t||")
+        self.logger.info("=" * 56)
+        self.logger.info("\n")
+        self.logger.debug(f"Using {args.log_level.upper()} logging level.")
+        self.logger.info(f"Experiment name: {args.exp_name}")
+        self.logger.info(f"Experiment directory: {self.exp_dir}")
+        self.checkpoint_dir = os.path.join(self.exp_dir, "checkpoint")
+        if self.accelerator.is_main_process:
+            os.makedirs(self.checkpoint_dir, exist_ok=True)
+        self.logger.debug(f"Checkpoint directory: {self.checkpoint_dir}")
+        # init counts
+        self.batch_count: int = 0
+        self.step: int = 0
+        self.epoch: int = 0
+        self.max_epoch = (
+            self.cfg.train.max_epoch if self.cfg.train.max_epoch > 0 else float("inf")
+        )
+        self.logger.info(
+            "Max epoch: {}".format(
+                self.max_epoch if self.max_epoch < float("inf") else "Unlimited"
+            )
+        )
+        # Check values
+        if self.accelerator.is_main_process:
+            self.__check_basic_configs()
+            # Set runtime configs
+            self.save_checkpoint_stride = self.cfg.train.save_checkpoint_stride
+            self.checkpoints_path = [
+                [] for _ in range(len(self.save_checkpoint_stride))
+            ]
+            self.keep_last = [
+                i if i > 0 else float("inf") for i in self.cfg.train.keep_last
+            ]
+            self.run_eval = self.cfg.train.run_eval
+        # set random seed
+        with self.accelerator.main_process_first():
+            start = time.monotonic_ns()
+            self._set_random_seed(self.cfg.train.random_seed)
+            end = time.monotonic_ns()
+            self.logger.debug(
+                f"Setting random seed done in {(end - start) / 1e6:.2f}ms"
+            )
+            self.logger.debug(f"Random seed: {self.cfg.train.random_seed}")
+        # setup data_loader
+        with self.accelerator.main_process_first():
+            self.logger.info("Building dataset...")
+            start = time.monotonic_ns()
+            self.train_dataloader, self.valid_dataloader = self._build_dataloader()
+            end = time.monotonic_ns()
+            self.logger.info(f"Building dataset done in {(end - start) / 1e6:.2f}ms")
+        # setup model
+        with self.accelerator.main_process_first():
+            self.logger.info("Building model...")
+            start = time.monotonic_ns()
+            self.model = self._build_model()
+            end = time.monotonic_ns()
+            self.logger.debug(self.model)
+            self.logger.info(f"Building model done in {(end - start) / 1e6:.2f}ms")
+            self.logger.info(
+                f"Model parameters: {self.__count_parameters(self.model)/1e6:.2f}M"
+            )
+        # optimizer & scheduler
+        with self.accelerator.main_process_first():
+            self.logger.info("Building optimizer and scheduler...")
+            start = time.monotonic_ns()
+            self.optimizer = self.__build_optimizer()
+            self.scheduler = self.__build_scheduler()
+            end = time.monotonic_ns()
+            self.logger.info(
+                f"Building optimizer and scheduler done in {(end - start) / 1e6:.2f}ms"
+            )
+        # accelerate prepare
+        self.logger.info("Initializing accelerate...")
+        start = time.monotonic_ns()
+        (
+            self.train_dataloader,
+            self.valid_dataloader,
+            self.model,
+            self.optimizer,
+            self.scheduler,
+        ) = self.accelerator.prepare(
+            self.train_dataloader,
+            self.valid_dataloader,
+            self.model,
+            self.optimizer,
+            self.scheduler,
+        )
+        end = time.monotonic_ns()
+        self.logger.info(f"Initializing accelerate done in {(end - start) / 1e6:.2f}ms")
+        # create criterion
+        with self.accelerator.main_process_first():
+            self.logger.info("Building criterion...")
+            start = time.monotonic_ns()
+            self.criterion = self._build_criterion()
+            end = time.monotonic_ns()
+            self.logger.info(f"Building criterion done in {(end - start) / 1e6:.2f}ms")
+        # Resume or Finetune
+        with self.accelerator.main_process_first():
+            if args.resume:
+                ## Automatically resume according to the current exprimental name
+                self.logger.info("Resuming from {}...".format(self.checkpoint_dir))
+                start = time.monotonic_ns()
+                ckpt_path = self.__load_model(
+                    checkpoint_dir=self.checkpoint_dir, resume_type=args.resume_type
+                )
+                end = time.monotonic_ns()
+                self.logger.info(
+                    f"Resuming from checkpoint done in {(end - start) / 1e6:.2f}ms"
+                )
+                self.checkpoints_path = json.load(
+                    open(os.path.join(ckpt_path, "ckpts.json"), "r")
+                )
+            elif args.resume_from_ckpt_path and args.resume_from_ckpt_path != "":
+                ## Resume from the given checkpoint path
+                if not os.path.exists(args.resume_from_ckpt_path):
+                    raise ValueError(
+                        "[Error] The resumed checkpoint path {} don't exist.".format(
+                            args.resume_from_ckpt_path
+                        )
+                    )
+                self.logger.info(
+                    "Resuming from {}...".format(args.resume_from_ckpt_path)
+                )
+                start = time.monotonic_ns()
+                ckpt_path = self.__load_model(
+                    checkpoint_path=args.resume_from_ckpt_path,
+                    resume_type=args.resume_type,
+                )
+                end = time.monotonic_ns()
+                self.logger.info(
+                    f"Resuming from checkpoint done in {(end - start) / 1e6:.2f}ms"
+                )
+        # save config file path
+        self.config_save_path = os.path.join(self.exp_dir, "args.json")
+    ### Following are abstract methods that should be implemented in child classes ###
+    @abstractmethod
+    def _build_dataset(self):
+        r"""Build dataset for model training/validating/evaluating."""
+        pass
+    @staticmethod
+    @abstractmethod
+    def _build_criterion():
+        r"""Build criterion function for model loss calculation."""
+        pass
+    @abstractmethod
+    def _build_model(self):
+        r"""Build model for training/validating/evaluating."""
+        pass
+    @abstractmethod
+    def _forward_step(self, batch):
+        r"""One forward step of the neural network. This abstract method is trying to
+        unify ``_train_step`` and ``_valid_step`` and avoid redundant implementation.
+        However, for special case that using different forward step pattern for
+        training and validating, you could just override this method with ``pass`` and
+        implement ``_train_step`` and ``_valid_step`` separately.
+        """
+        pass
+    @abstractmethod
+    def _save_auxiliary_states(self):
+        r"""To save some auxiliary states when saving model's ckpt"""
+        pass
+    ### Abstract methods end ###
+    ### THIS IS MAIN ENTRY ###
+    def train_loop(self):
+        r"""Training loop. The public entry of training process."""
+        # Wait everyone to prepare before we move on
+        self.accelerator.wait_for_everyone()
+        # dump config file
+        if self.accelerator.is_main_process:
+            self.__dump_cfg(self.config_save_path)
+        self.model.train()
+        self.optimizer.zero_grad()
+        # Wait to ensure good to go
+        self.accelerator.wait_for_everyone()
+        while self.epoch < self.max_epoch:
+            self.logger.info("\n")
+            self.logger.info("-" * 32)
+            self.logger.info("Epoch {}: ".format(self.epoch))
+            ### TODO: change the return values of _train_epoch() to a loss dict, or (total_loss, loss_dict)
+            ### It's inconvenient for the model with multiple losses
+            # Do training & validating epoch
+            train_loss = self._train_epoch()
+            self.logger.info("  |- Train/Loss: {:.6f}".format(train_loss))
+            valid_loss = self._valid_epoch()
+            self.logger.info("  |- Valid/Loss: {:.6f}".format(valid_loss))
+            self.accelerator.log(
+                {"Epoch/Train Loss": train_loss, "Epoch/Valid Loss": valid_loss},
+                step=self.epoch,
+            )
+            self.accelerator.wait_for_everyone()
+            # TODO: what is scheduler?
+            self.scheduler.step(valid_loss)  # FIXME: use epoch track correct?
+            # Check if hit save_checkpoint_stride and run_eval
+            run_eval = False
+            if self.accelerator.is_main_process:
+                save_checkpoint = False
+                hit_dix = []
+                for i, num in enumerate(self.save_checkpoint_stride):
+                    if self.epoch % num == 0:
+                        save_checkpoint = True
+                        hit_dix.append(i)
+                        run_eval |= self.run_eval[i]
+            self.accelerator.wait_for_everyone()
+            if self.accelerator.is_main_process and save_checkpoint:
+                path = os.path.join(
+                    self.checkpoint_dir,
+                    "epoch-{:04d}_step-{:07d}_loss-{:.6f}".format(
+                        self.epoch, self.step, train_loss
+                    ),
+                )
+                self.tmp_checkpoint_save_path = path
+                self.accelerator.save_state(path)
+                print(f"save checkpoint in {path}")
+                json.dump(
+                    self.checkpoints_path,
+                    open(os.path.join(path, "ckpts.json"), "w"),
+                    ensure_ascii=False,
+                    indent=4,
+                )
+                self._save_auxiliary_states()
+                # Remove old checkpoints
+                to_remove = []
+                for idx in hit_dix:
+                    self.checkpoints_path[idx].append(path)
+                    while len(self.checkpoints_path[idx]) > self.keep_last[idx]:
+                        to_remove.append((idx, self.checkpoints_path[idx].pop(0)))
+                # Search conflicts
+                total = set()
+                for i in self.checkpoints_path:
+                    total |= set(i)
+                do_remove = set()
+                for idx, path in to_remove[::-1]:
+                    if path in total:
+                        self.checkpoints_path[idx].insert(0, path)
+                    else:
+                        do_remove.add(path)
+                # Remove old checkpoints
+                for path in do_remove:
+                    shutil.rmtree(path, ignore_errors=True)
+                    self.logger.debug(f"Remove old checkpoint: {path}")
+            self.accelerator.wait_for_everyone()
+            if run_eval:
+                # TODO: run evaluation
+                pass
+            # Update info for each epoch
+            self.epoch += 1
+        # Finish training and save final checkpoint
+        self.accelerator.wait_for_everyone()
+        if self.accelerator.is_main_process:
+            self.accelerator.save_state(
+                os.path.join(
+                    self.checkpoint_dir,
+                    "final_epoch-{:04d}_step-{:07d}_loss-{:.6f}".format(
+                        self.epoch, self.step, valid_loss
+                    ),
+                )
+            )
+            self._save_auxiliary_states()
+        self.accelerator.end_training()
+    ### Following are methods that can be used directly in child classes ###
+    def _train_epoch(self):
+        r"""Training epoch. Should return average loss of a batch (sample) over
+        one epoch. See ``train_loop`` for usage.
+        """
+        self.model.train()
+        epoch_sum_loss: float = 0.0
+        epoch_step: int = 0
+        for batch in tqdm(
+            self.train_dataloader,
+            desc=f"Training Epoch {self.epoch}",
+            unit="batch",
+            colour="GREEN",
+            leave=False,
+            dynamic_ncols=True,
+            smoothing=0.04,
+            disable=not self.accelerator.is_main_process,
+        ):
+            # Do training step and BP
+            with self.accelerator.accumulate(self.model):
+                loss = self._train_step(batch)
+                self.accelerator.backward(loss)
+                self.optimizer.step()
+                self.optimizer.zero_grad()
+            self.batch_count += 1
+            # Update info for each step
+            # TODO: step means BP counts or batch counts?
+            if self.batch_count % self.cfg.train.gradient_accumulation_step == 0:
+                epoch_sum_loss += loss
+                self.accelerator.log(
+                    {
+                        "Step/Train Loss": loss,
+                        "Step/Learning Rate": self.optimizer.param_groups[0]["lr"],
+                    },
+                    step=self.step,
+                )
+                self.step += 1
+                epoch_step += 1
+        self.accelerator.wait_for_everyone()
+        return (
+            epoch_sum_loss
+            / len(self.train_dataloader)
+            * self.cfg.train.gradient_accumulation_step
+        )
+    @torch.inference_mode()
+    def _valid_epoch(self):
+        r"""Testing epoch. Should return average loss of a batch (sample) over
+        one epoch. See ``train_loop`` for usage.
+        """
+        self.model.eval()
+        epoch_sum_loss = 0.0
+        for batch in tqdm(
+            self.valid_dataloader,
+            desc=f"Validating Epoch {self.epoch}",
+            unit="batch",
+            colour="GREEN",
+            leave=False,
+            dynamic_ncols=True,
+            smoothing=0.04,
+            disable=not self.accelerator.is_main_process,
+        ):
+            batch_loss = self._valid_step(batch)
+            epoch_sum_loss += batch_loss.item()
+        self.accelerator.wait_for_everyone()
+        return epoch_sum_loss / len(self.valid_dataloader)
+    def _train_step(self, batch):
+        r"""Training forward step. Should return average loss of a sample over
+        one batch. Provoke ``_forward_step`` is recommended except for special case.
+        See ``_train_epoch`` for usage.
+        """
+        return self._forward_step(batch)
+    @torch.inference_mode()
+    def _valid_step(self, batch):
+        r"""Testing forward step. Should return average loss of a sample over
+        one batch. Provoke ``_forward_step`` is recommended except for special case.
+        See ``_test_epoch`` for usage.
+        """
+        return self._forward_step(batch)
+    def __load_model(
+        self,
+        checkpoint_dir: str = None,
+        checkpoint_path: str = None,
+        resume_type: str = "",
+    ):
+        r"""Load model from checkpoint. If checkpoint_path is None, it will
+        load the latest checkpoint in checkpoint_dir. If checkpoint_path is not
+        None, it will load the checkpoint specified by checkpoint_path. **Only use this
+        method after** ``accelerator.prepare()``.
+        """
+        if checkpoint_path is None:
+            ls = [str(i) for i in Path(checkpoint_dir).glob("*")]
+            ls.sort(key=lambda x: int(x.split("_")[-3].split("-")[-1]), reverse=True)
+            checkpoint_path = ls[0]
+            self.logger.info("Resume from {}...".format(checkpoint_path))
+        if resume_type in ["resume", ""]:
+            # Load all the things, including model weights, optimizer, scheduler, and random states.
+            self.accelerator.load_state(input_dir=checkpoint_path)
+            # set epoch and step
+            self.epoch = int(checkpoint_path.split("_")[-3].split("-")[-1]) + 1
+            self.step = int(checkpoint_path.split("_")[-2].split("-")[-1]) + 1
+        elif resume_type == "finetune":
+            # Load only the model weights
+            accelerate.load_checkpoint_and_dispatch(
+                self.accelerator.unwrap_model(self.model),
+                os.path.join(checkpoint_path, "pytorch_model.bin"),
+            )
+            self.logger.info("Load model weights for finetune...")
+        else:
+            raise ValueError("Resume_type must be `resume` or `finetune`.")
+        return checkpoint_path
+    # TODO: LEGACY CODE
+    def _build_dataloader(self):
+        Dataset, Collator = self._build_dataset()
+        # build dataset instance for each dataset and combine them by ConcatDataset
+        datasets_list = []
+        for dataset in self.cfg.dataset:
+            subdataset = Dataset(self.cfg, dataset, is_valid=False)
+            datasets_list.append(subdataset)
+        train_dataset = ConcatDataset(datasets_list)
+        train_collate = Collator(self.cfg)
+        _, batch_sampler = build_samplers(train_dataset, self.cfg, self.logger, "train")
+        self.logger.debug(f"train batch_sampler: {list(batch_sampler)}")
+        self.logger.debug(f"length: {train_dataset.cumulative_sizes}")
+        # TODO: use config instead of (sampler, shuffle, drop_last, batch_size)
+        train_loader = DataLoader(
+            train_dataset,
+            collate_fn=train_collate,
+            batch_sampler=batch_sampler,
+            num_workers=self.cfg.train.dataloader.num_worker,
+            pin_memory=self.cfg.train.dataloader.pin_memory,
+        )
+        # Build valid dataloader
+        datasets_list = []
+        for dataset in self.cfg.dataset:
+            subdataset = Dataset(self.cfg, dataset, is_valid=True)
+            datasets_list.append(subdataset)
+        valid_dataset = ConcatDataset(datasets_list)
+        valid_collate = Collator(self.cfg)
+        _, batch_sampler = build_samplers(valid_dataset, self.cfg, self.logger, "valid")
+        self.logger.debug(f"valid batch_sampler: {list(batch_sampler)}")
+        self.logger.debug(f"length: {valid_dataset.cumulative_sizes}")
+        valid_loader = DataLoader(
+            valid_dataset,
+            collate_fn=valid_collate,
+            batch_sampler=batch_sampler,
+            num_workers=self.cfg.train.dataloader.num_worker,
+            pin_memory=self.cfg.train.dataloader.pin_memory,
+        )
+        return train_loader, valid_loader
+    @staticmethod
+    def _set_random_seed(seed):
+        r"""Set random seed for all possible random modules."""
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.random.manual_seed(seed)
+    def _check_nan(self, loss, y_pred, y_gt):
+        if torch.any(torch.isnan(loss)):
+            self.logger.fatal("Fatal Error: Training is down since loss has Nan!")
+            self.logger.error("loss = {:.6f}".format(loss.item()), in_order=True)
+            if torch.any(torch.isnan(y_pred)):
+                self.logger.error(
+                    f"y_pred has Nan: {torch.any(torch.isnan(y_pred))}", in_order=True
+                )
+            else:
+                self.logger.debug(
+                    f"y_pred has Nan: {torch.any(torch.isnan(y_pred))}", in_order=True
+                )
+            if torch.any(torch.isnan(y_gt)):
+                self.logger.error(
+                    f"y_gt has Nan: {torch.any(torch.isnan(y_gt))}", in_order=True
+                )
+            else:
+                self.logger.debug(
+                    f"y_gt has nan: {torch.any(torch.isnan(y_gt))}", in_order=True
+                )
+            if torch.any(torch.isnan(y_pred)):
+                self.logger.error(f"y_pred: {y_pred}", in_order=True)
+            else:
+                self.logger.debug(f"y_pred: {y_pred}", in_order=True)
+            if torch.any(torch.isnan(y_gt)):
+                self.logger.error(f"y_gt: {y_gt}", in_order=True)
+            else:
+                self.logger.debug(f"y_gt: {y_gt}", in_order=True)
+            # TODO: still OK to save tracking?
+            self.accelerator.end_training()
+            raise RuntimeError("Loss has Nan! See log for more info.")
+    ### Protected methods end ###
+    ## Following are private methods ##
+    ## !!! These are inconvenient for GAN-based model training. It'd be better to move these to svc_trainer.py if needed.
+    def __build_optimizer(self):
+        r"""Build optimizer for model."""
+        # Make case-insensitive matching
+        if self.cfg.train.optimizer.lower() == "adadelta":
+            optimizer = torch.optim.Adadelta(
+                self.model.parameters(), **self.cfg.train.adadelta
+            )
+            self.logger.info("Using Adadelta optimizer.")
+        elif self.cfg.train.optimizer.lower() == "adagrad":
+            optimizer = torch.optim.Adagrad(
+                self.model.parameters(), **self.cfg.train.adagrad
+            )
+            self.logger.info("Using Adagrad optimizer.")
+        elif self.cfg.train.optimizer.lower() == "adam":
+            optimizer = torch.optim.Adam(self.model.parameters(), **self.cfg.train.adam)
+            self.logger.info("Using Adam optimizer.")
+        elif self.cfg.train.optimizer.lower() == "adamw":
+            optimizer = torch.optim.AdamW(
+                self.model.parameters(), **self.cfg.train.adamw
+            )
+        elif self.cfg.train.optimizer.lower() == "sparseadam":
+            optimizer = torch.optim.SparseAdam(
+                self.model.parameters(), **self.cfg.train.sparseadam
+            )
+        elif self.cfg.train.optimizer.lower() == "adamax":
+            optimizer = torch.optim.Adamax(
+                self.model.parameters(), **self.cfg.train.adamax
+            )
+        elif self.cfg.train.optimizer.lower() == "asgd":
+            optimizer = torch.optim.ASGD(self.model.parameters(), **self.cfg.train.asgd)
+        elif self.cfg.train.optimizer.lower() == "lbfgs":
+            optimizer = torch.optim.LBFGS(
+                self.model.parameters(), **self.cfg.train.lbfgs
+            )
+        elif self.cfg.train.optimizer.lower() == "nadam":
+            optimizer = torch.optim.NAdam(
+                self.model.parameters(), **self.cfg.train.nadam
+            )
+        elif self.cfg.train.optimizer.lower() == "radam":
+            optimizer = torch.optim.RAdam(
+                self.model.parameters(), **self.cfg.train.radam
+            )
+        elif self.cfg.train.optimizer.lower() == "rmsprop":
+            optimizer = torch.optim.RMSprop(
+                self.model.parameters(), **self.cfg.train.rmsprop
+            )
+        elif self.cfg.train.optimizer.lower() == "rprop":
+            optimizer = torch.optim.Rprop(
+                self.model.parameters(), **self.cfg.train.rprop
+            )
+        elif self.cfg.train.optimizer.lower() == "sgd":
+            optimizer = torch.optim.SGD(self.model.parameters(), **self.cfg.train.sgd)
+        else:
+            raise NotImplementedError(
+                f"Optimizer {self.cfg.train.optimizer} not supported yet!"
+            )
+        return optimizer
+    def __build_scheduler(self):
+        r"""Build scheduler for optimizer."""
+        # Make case-insensitive matching
+        if self.cfg.train.scheduler.lower() == "lambdalr":
+            scheduler = torch.optim.lr_scheduler.LambdaLR(
+                self.optimizer, **self.cfg.train.lambdalr
+            )
+        elif self.cfg.train.scheduler.lower() == "multiplicativelr":
+            scheduler = torch.optim.lr_scheduler.MultiplicativeLR(
+                self.optimizer, **self.cfg.train.multiplicativelr
+            )
+        elif self.cfg.train.scheduler.lower() == "steplr":
+            scheduler = torch.optim.lr_scheduler.StepLR(
+                self.optimizer, **self.cfg.train.steplr
+            )
+        elif self.cfg.train.scheduler.lower() == "multisteplr":
+            scheduler = torch.optim.lr_scheduler.MultiStepLR(
+                self.optimizer, **self.cfg.train.multisteplr
+            )
+        elif self.cfg.train.scheduler.lower() == "constantlr":
+            scheduler = torch.optim.lr_scheduler.ConstantLR(
+                self.optimizer, **self.cfg.train.constantlr
+            )
+        elif self.cfg.train.scheduler.lower() == "linearlr":
+            scheduler = torch.optim.lr_scheduler.LinearLR(
+                self.optimizer, **self.cfg.train.linearlr
+            )
+        elif self.cfg.train.scheduler.lower() == "exponentiallr":
+            scheduler = torch.optim.lr_scheduler.ExponentialLR(
+                self.optimizer, **self.cfg.train.exponentiallr
+            )
+        elif self.cfg.train.scheduler.lower() == "polynomiallr":
+            scheduler = torch.optim.lr_scheduler.PolynomialLR(
+                self.optimizer, **self.cfg.train.polynomiallr
+            )
+        elif self.cfg.train.scheduler.lower() == "cosineannealinglr":
+            scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+                self.optimizer, **self.cfg.train.cosineannealinglr
+            )
+        elif self.cfg.train.scheduler.lower() == "sequentiallr":
+            scheduler = torch.optim.lr_scheduler.SequentialLR(
+                self.optimizer, **self.cfg.train.sequentiallr
+            )
+        elif self.cfg.train.scheduler.lower() == "reducelronplateau":
+            scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
+                self.optimizer, **self.cfg.train.reducelronplateau
+            )
+        elif self.cfg.train.scheduler.lower() == "cycliclr":
+            scheduler = torch.optim.lr_scheduler.CyclicLR(
+                self.optimizer, **self.cfg.train.cycliclr
+            )
+        elif self.cfg.train.scheduler.lower() == "onecyclelr":
+            scheduler = torch.optim.lr_scheduler.OneCycleLR(
+                self.optimizer, **self.cfg.train.onecyclelr
+            )
+        elif self.cfg.train.scheduler.lower() == "cosineannearingwarmrestarts":
+            scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
+                self.optimizer, **self.cfg.train.cosineannearingwarmrestarts
+            )
+        elif self.cfg.train.scheduler.lower() == "noamlr":
+            scheduler = NoamLR(self.optimizer, **self.cfg.train.lr_scheduler)
+        else:
+            raise NotImplementedError(
+                f"Scheduler {self.cfg.train.scheduler} not supported yet!"
+            )
+        return scheduler
+    def _init_accelerator(self):
+        self.exp_dir = os.path.join(
+            os.path.abspath(self.cfg.log_dir), self.args.exp_name
+        )
+        project_config = ProjectConfiguration(
+            project_dir=self.exp_dir,
+            logging_dir=os.path.join(self.exp_dir, "log"),
+        )
+        self.accelerator = accelerate.Accelerator(
+            gradient_accumulation_steps=self.cfg.train.gradient_accumulation_step,
+            log_with=self.cfg.train.tracker,
+            project_config=project_config,
+        )
+        if self.accelerator.is_main_process:
+            os.makedirs(project_config.project_dir, exist_ok=True)
+            os.makedirs(project_config.logging_dir, exist_ok=True)
+        with self.accelerator.main_process_first():
+            self.accelerator.init_trackers(self.args.exp_name)
+    def __check_basic_configs(self):
+        if self.cfg.train.gradient_accumulation_step <= 0:
+            self.logger.fatal("Invalid gradient_accumulation_step value!")
+            self.logger.error(
+                f"Invalid gradient_accumulation_step value: {self.cfg.train.gradient_accumulation_step}. It should be positive."
+            )
+            self.accelerator.end_training()
+            raise ValueError(
+                f"Invalid gradient_accumulation_step value: {self.cfg.train.gradient_accumulation_step}. It should be positive."
+            )
+        # TODO: check other values
+    @staticmethod
+    def __count_parameters(model):
+        model_param = 0.0
+        if isinstance(model, dict):
+            for key, value in model.items():
+                model_param += sum(p.numel() for p in model[key].parameters())
+        else:
+            model_param = sum(p.numel() for p in model.parameters())
+        return model_param
+    def __dump_cfg(self, path):
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+        json5.dump(
+            self.cfg,
+            open(path, "w"),
+            indent=4,
+            sort_keys=True,
+            ensure_ascii=False,
+            quote_keys=True,
+        )
+    ### Private methods end ###

models/svc/__init__.py ADDED Viewed

File without changes

models/svc/base/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .svc_inference import SVCInference
+from .svc_trainer import SVCTrainer

models/svc/base/svc_dataset.py ADDED Viewed

	@@ -0,0 +1,425 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import random
+import torch
+from torch.nn.utils.rnn import pad_sequence
+import json
+import os
+import numpy as np
+from utils.data_utils import *
+from processors.acoustic_extractor import cal_normalized_mel, load_mel_extrema
+from processors.content_extractor import (
+    ContentvecExtractor,
+    WhisperExtractor,
+    WenetExtractor,
+)
+from models.base.base_dataset import (
+    BaseCollator,
+    BaseDataset,
+)
+from models.base.new_dataset import BaseTestDataset
+EPS = 1.0e-12
+class SVCDataset(BaseDataset):
+    def __init__(self, cfg, dataset, is_valid=False):
+        BaseDataset.__init__(self, cfg, dataset, is_valid=is_valid)
+        cfg = self.cfg
+        if cfg.model.condition_encoder.use_whisper:
+            self.whisper_aligner = WhisperExtractor(self.cfg)
+            self.utt2whisper_path = load_content_feature_path(
+                self.metadata, cfg.preprocess.processed_dir, cfg.preprocess.whisper_dir
+            )
+        if cfg.model.condition_encoder.use_contentvec:
+            self.contentvec_aligner = ContentvecExtractor(self.cfg)
+            self.utt2contentVec_path = load_content_feature_path(
+                self.metadata,
+                cfg.preprocess.processed_dir,
+                cfg.preprocess.contentvec_dir,
+            )
+        if cfg.model.condition_encoder.use_mert:
+            self.utt2mert_path = load_content_feature_path(
+                self.metadata, cfg.preprocess.processed_dir, cfg.preprocess.mert_dir
+            )
+        if cfg.model.condition_encoder.use_wenet:
+            self.wenet_aligner = WenetExtractor(self.cfg)
+            self.utt2wenet_path = load_content_feature_path(
+                self.metadata, cfg.preprocess.processed_dir, cfg.preprocess.wenet_dir
+            )
+    def __getitem__(self, index):
+        single_feature = BaseDataset.__getitem__(self, index)
+        utt_info = self.metadata[index]
+        dataset = utt_info["Dataset"]
+        uid = utt_info["Uid"]
+        utt = "{}_{}".format(dataset, uid)
+        if self.cfg.model.condition_encoder.use_whisper:
+            assert "target_len" in single_feature.keys()
+            aligned_whisper_feat = self.whisper_aligner.offline_align(
+                np.load(self.utt2whisper_path[utt]), single_feature["target_len"]
+            )
+            single_feature["whisper_feat"] = aligned_whisper_feat
+        if self.cfg.model.condition_encoder.use_contentvec:
+            assert "target_len" in single_feature.keys()
+            aligned_contentvec = self.contentvec_aligner.offline_align(
+                np.load(self.utt2contentVec_path[utt]), single_feature["target_len"]
+            )
+            single_feature["contentvec_feat"] = aligned_contentvec
+        if self.cfg.model.condition_encoder.use_mert:
+            assert "target_len" in single_feature.keys()
+            aligned_mert_feat = align_content_feature_length(
+                np.load(self.utt2mert_path[utt]),
+                single_feature["target_len"],
+                source_hop=self.cfg.preprocess.mert_hop_size,
+            )
+            single_feature["mert_feat"] = aligned_mert_feat
+        if self.cfg.model.condition_encoder.use_wenet:
+            assert "target_len" in single_feature.keys()
+            aligned_wenet_feat = self.wenet_aligner.offline_align(
+                np.load(self.utt2wenet_path[utt]), single_feature["target_len"]
+            )
+            single_feature["wenet_feat"] = aligned_wenet_feat
+        # print(single_feature.keys())
+        # for k, v in single_feature.items():
+        #     if type(v) in [torch.Tensor, np.ndarray]:
+        #         print(k, v.shape)
+        #     else:
+        #         print(k, v)
+        # exit()
+        return self.clip_if_too_long(single_feature)
+    def __len__(self):
+        return len(self.metadata)
+    def random_select(self, feature_seq_len, max_seq_len, ending_ts=2812):
+        """
+        ending_ts: to avoid invalid whisper features for over 30s audios
+            2812 = 30 * 24000 // 256
+        """
+        ts = max(feature_seq_len - max_seq_len, 0)
+        ts = min(ts, ending_ts - max_seq_len)
+        start = random.randint(0, ts)
+        end = start + max_seq_len
+        return start, end
+    def clip_if_too_long(self, sample, max_seq_len=512):
+        """
+        sample :
+            {
+                'spk_id': (1,),
+                'target_len': int
+                'mel': (seq_len, dim),
+                'frame_pitch': (seq_len,)
+                'frame_energy': (seq_len,)
+                'content_vector_feat': (seq_len, dim)
+            }
+        """
+        if sample["target_len"] <= max_seq_len:
+            return sample
+        start, end = self.random_select(sample["target_len"], max_seq_len)
+        sample["target_len"] = end - start
+        for k in sample.keys():
+            if k not in ["spk_id", "target_len"]:
+                sample[k] = sample[k][start:end]
+        return sample
+class SVCCollator(BaseCollator):
+    """Zero-pads model inputs and targets based on number of frames per step"""
+    def __init__(self, cfg):
+        BaseCollator.__init__(self, cfg)
+    def __call__(self, batch):
+        parsed_batch_features = BaseCollator.__call__(self, batch)
+        return parsed_batch_features
+class SVCTestDataset(BaseTestDataset):
+    def __init__(self, args, cfg, infer_type):
+        BaseTestDataset.__init__(self, args, cfg, infer_type)
+        self.metadata = self.get_metadata()
+        target_singer = args.target_singer
+        self.cfg = cfg
+        self.trans_key = args.trans_key
+        assert type(target_singer) == str
+        self.target_singer = target_singer.split("_")[-1]
+        self.target_dataset = target_singer.replace(
+            "_{}".format(self.target_singer), ""
+        )
+        self.target_mel_extrema = load_mel_extrema(cfg.preprocess, self.target_dataset)
+        self.target_mel_extrema = torch.as_tensor(
+            self.target_mel_extrema[0]
+        ), torch.as_tensor(self.target_mel_extrema[1])
+        ######### Load source acoustic features #########
+        if cfg.preprocess.use_spkid:
+            spk2id_path = os.path.join(args.acoustics_dir, cfg.preprocess.spk2id)
+            # utt2sp_path = os.path.join(self.data_root, cfg.preprocess.utt2spk)
+            with open(spk2id_path, "r") as f:
+                self.spk2id = json.load(f)
+            # print("self.spk2id", self.spk2id)
+        if cfg.preprocess.use_uv:
+            self.utt2uv_path = {
+                f'{utt_info["Dataset"]}_{utt_info["Uid"]}': os.path.join(
+                    cfg.preprocess.processed_dir,
+                    utt_info["Dataset"],
+                    cfg.preprocess.uv_dir,
+                    utt_info["Uid"] + ".npy",
+                )
+                for utt_info in self.metadata
+            }
+        if cfg.preprocess.use_frame_pitch:
+            self.utt2frame_pitch_path = {
+                f'{utt_info["Dataset"]}_{utt_info["Uid"]}': os.path.join(
+                    cfg.preprocess.processed_dir,
+                    utt_info["Dataset"],
+                    cfg.preprocess.pitch_dir,
+                    utt_info["Uid"] + ".npy",
+                )
+                for utt_info in self.metadata
+            }
+            # Target F0 median
+            target_f0_statistics_path = os.path.join(
+                cfg.preprocess.processed_dir,
+                self.target_dataset,
+                cfg.preprocess.pitch_dir,
+                "statistics.json",
+            )
+            self.target_pitch_median = json.load(open(target_f0_statistics_path, "r"))[
+                f"{self.target_dataset}_{self.target_singer}"
+            ]["voiced_positions"]["median"]
+            # Source F0 median (if infer from file)
+            if infer_type == "from_file":
+                source_audio_name = cfg.inference.source_audio_name
+                source_f0_statistics_path = os.path.join(
+                    cfg.preprocess.processed_dir,
+                    source_audio_name,
+                    cfg.preprocess.pitch_dir,
+                    "statistics.json",
+                )
+                self.source_pitch_median = json.load(
+                    open(source_f0_statistics_path, "r")
+                )[f"{source_audio_name}_{source_audio_name}"]["voiced_positions"][
+                    "median"
+                ]
+            else:
+                self.source_pitch_median = None
+        if cfg.preprocess.use_frame_energy:
+            self.utt2frame_energy_path = {
+                f'{utt_info["Dataset"]}_{utt_info["Uid"]}': os.path.join(
+                    cfg.preprocess.processed_dir,
+                    utt_info["Dataset"],
+                    cfg.preprocess.energy_dir,
+                    utt_info["Uid"] + ".npy",
+                )
+                for utt_info in self.metadata
+            }
+        if cfg.preprocess.use_mel:
+            self.utt2mel_path = {
+                f'{utt_info["Dataset"]}_{utt_info["Uid"]}': os.path.join(
+                    cfg.preprocess.processed_dir,
+                    utt_info["Dataset"],
+                    cfg.preprocess.mel_dir,
+                    utt_info["Uid"] + ".npy",
+                )
+                for utt_info in self.metadata
+            }
+        ######### Load source content features' path #########
+        if cfg.model.condition_encoder.use_whisper:
+            self.whisper_aligner = WhisperExtractor(cfg)
+            self.utt2whisper_path = load_content_feature_path(
+                self.metadata, cfg.preprocess.processed_dir, cfg.preprocess.whisper_dir
+            )
+        if cfg.model.condition_encoder.use_contentvec:
+            self.contentvec_aligner = ContentvecExtractor(cfg)
+            self.utt2contentVec_path = load_content_feature_path(
+                self.metadata,
+                cfg.preprocess.processed_dir,
+                cfg.preprocess.contentvec_dir,
+            )
+        if cfg.model.condition_encoder.use_mert:
+            self.utt2mert_path = load_content_feature_path(
+                self.metadata, cfg.preprocess.processed_dir, cfg.preprocess.mert_dir
+            )
+        if cfg.model.condition_encoder.use_wenet:
+            self.wenet_aligner = WenetExtractor(cfg)
+            self.utt2wenet_path = load_content_feature_path(
+                self.metadata, cfg.preprocess.processed_dir, cfg.preprocess.wenet_dir
+            )
+    def __getitem__(self, index):
+        single_feature = {}
+        utt_info = self.metadata[index]
+        dataset = utt_info["Dataset"]
+        uid = utt_info["Uid"]
+        utt = "{}_{}".format(dataset, uid)
+        source_dataset = self.metadata[index]["Dataset"]
+        if self.cfg.preprocess.use_spkid:
+            single_feature["spk_id"] = np.array(
+                [self.spk2id[f"{self.target_dataset}_{self.target_singer}"]],
+                dtype=np.int32,
+            )
+        ######### Get Acoustic Features Item #########
+        if self.cfg.preprocess.use_mel:
+            mel = np.load(self.utt2mel_path[utt])
+            assert mel.shape[0] == self.cfg.preprocess.n_mel  # [n_mels, T]
+            if self.cfg.preprocess.use_min_max_norm_mel:
+                # mel norm
+                mel = cal_normalized_mel(mel, source_dataset, self.cfg.preprocess)
+            if "target_len" not in single_feature.keys():
+                single_feature["target_len"] = mel.shape[1]
+            single_feature["mel"] = mel.T  # [T, n_mels]
+        if self.cfg.preprocess.use_frame_pitch:
+            frame_pitch_path = self.utt2frame_pitch_path[utt]
+            frame_pitch = np.load(frame_pitch_path)
+            if self.trans_key:
+                try:
+                    self.trans_key = int(self.trans_key)
+                except:
+                    pass
+                if type(self.trans_key) == int:
+                    frame_pitch = transpose_key(frame_pitch, self.trans_key)
+                elif self.trans_key:
+                    assert self.target_singer
+                    frame_pitch = pitch_shift_to_target(
+                        frame_pitch, self.target_pitch_median, self.source_pitch_median
+                    )
+            if "target_len" not in single_feature.keys():
+                single_feature["target_len"] = len(frame_pitch)
+            aligned_frame_pitch = align_length(
+                frame_pitch, single_feature["target_len"]
+            )
+            single_feature["frame_pitch"] = aligned_frame_pitch
+            if self.cfg.preprocess.use_uv:
+                frame_uv_path = self.utt2uv_path[utt]
+                frame_uv = np.load(frame_uv_path)
+                aligned_frame_uv = align_length(frame_uv, single_feature["target_len"])
+                aligned_frame_uv = [
+                    0 if frame_uv else 1 for frame_uv in aligned_frame_uv
+                ]
+                aligned_frame_uv = np.array(aligned_frame_uv)
+                single_feature["frame_uv"] = aligned_frame_uv
+        if self.cfg.preprocess.use_frame_energy:
+            frame_energy_path = self.utt2frame_energy_path[utt]
+            frame_energy = np.load(frame_energy_path)
+            if "target_len" not in single_feature.keys():
+                single_feature["target_len"] = len(frame_energy)
+            aligned_frame_energy = align_length(
+                frame_energy, single_feature["target_len"]
+            )
+            single_feature["frame_energy"] = aligned_frame_energy
+        ######### Get Content Features Item #########
+        if self.cfg.model.condition_encoder.use_whisper:
+            assert "target_len" in single_feature.keys()
+            aligned_whisper_feat = self.whisper_aligner.offline_align(
+                np.load(self.utt2whisper_path[utt]), single_feature["target_len"]
+            )
+            single_feature["whisper_feat"] = aligned_whisper_feat
+        if self.cfg.model.condition_encoder.use_contentvec:
+            assert "target_len" in single_feature.keys()
+            aligned_contentvec = self.contentvec_aligner.offline_align(
+                np.load(self.utt2contentVec_path[utt]), single_feature["target_len"]
+            )
+            single_feature["contentvec_feat"] = aligned_contentvec
+        if self.cfg.model.condition_encoder.use_mert:
+            assert "target_len" in single_feature.keys()
+            aligned_mert_feat = align_content_feature_length(
+                np.load(self.utt2mert_path[utt]),
+                single_feature["target_len"],
+                source_hop=self.cfg.preprocess.mert_hop_size,
+            )
+            single_feature["mert_feat"] = aligned_mert_feat
+        if self.cfg.model.condition_encoder.use_wenet:
+            assert "target_len" in single_feature.keys()
+            aligned_wenet_feat = self.wenet_aligner.offline_align(
+                np.load(self.utt2wenet_path[utt]), single_feature["target_len"]
+            )
+            single_feature["wenet_feat"] = aligned_wenet_feat
+        return single_feature
+    def __len__(self):
+        return len(self.metadata)
+class SVCTestCollator:
+    """Zero-pads model inputs and targets based on number of frames per step"""
+    def __init__(self, cfg):
+        self.cfg = cfg
+    def __call__(self, batch):
+        packed_batch_features = dict()
+        # mel: [b, T, n_mels]
+        # frame_pitch, frame_energy: [1, T]
+        # target_len: [1]
+        # spk_id: [b, 1]
+        # mask: [b, T, 1]
+        for key in batch[0].keys():
+            if key == "target_len":
+                packed_batch_features["target_len"] = torch.LongTensor(
+                    [b["target_len"] for b in batch]
+                )
+                masks = [
+                    torch.ones((b["target_len"], 1), dtype=torch.long) for b in batch
+                ]
+                packed_batch_features["mask"] = pad_sequence(
+                    masks, batch_first=True, padding_value=0
+                )
+            else:
+                values = [torch.from_numpy(b[key]) for b in batch]
+                packed_batch_features[key] = pad_sequence(
+                    values, batch_first=True, padding_value=0
+                )
+        return packed_batch_features

models/svc/base/svc_inference.py ADDED Viewed

	@@ -0,0 +1,15 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from models.base.new_inference import BaseInference
+from models.svc.base.svc_dataset import SVCTestCollator, SVCTestDataset
+class SVCInference(BaseInference):
+    def __init__(self, args=None, cfg=None, infer_type="from_dataset"):
+        BaseInference.__init__(self, args, cfg, infer_type)
+    def _build_test_dataset(self):
+        return SVCTestDataset, SVCTestCollator

models/svc/base/svc_trainer.py ADDED Viewed

	@@ -0,0 +1,111 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import json
+import os
+import torch
+import torch.nn as nn
+from models.base.new_trainer import BaseTrainer
+from models.svc.base.svc_dataset import SVCCollator, SVCDataset
+class SVCTrainer(BaseTrainer):
+    r"""The base trainer for all SVC models. It inherits from BaseTrainer and implements
+    ``build_criterion``, ``_build_dataset`` and ``_build_singer_lut`` methods. You can inherit from this
+    class, and implement ``_build_model``, ``_forward_step``.
+    """
+    def __init__(self, args=None, cfg=None):
+        self.args = args
+        self.cfg = cfg
+        self._init_accelerator()
+        # Only for SVC tasks
+        with self.accelerator.main_process_first():
+            self.singers = self._build_singer_lut()
+        # Super init
+        BaseTrainer.__init__(self, args, cfg)
+        # Only for SVC tasks
+        self.task_type = "SVC"
+        self.logger.info("Task type: {}".format(self.task_type))
+    ### Following are methods only for SVC tasks ###
+    # TODO: LEGACY CODE, NEED TO BE REFACTORED
+    def _build_dataset(self):
+        return SVCDataset, SVCCollator
+    @staticmethod
+    def _build_criterion():
+        criterion = nn.MSELoss(reduction="none")
+        return criterion
+    @staticmethod
+    def _compute_loss(criterion, y_pred, y_gt, loss_mask):
+        """
+        Args:
+            criterion: MSELoss(reduction='none')
+            y_pred, y_gt: (bs, seq_len, D)
+            loss_mask: (bs, seq_len, 1)
+        Returns:
+            loss: Tensor of shape []
+        """
+        # (bs, seq_len, D)
+        loss = criterion(y_pred, y_gt)
+        # expand loss_mask to (bs, seq_len, D)
+        loss_mask = loss_mask.repeat(1, 1, loss.shape[-1])
+        loss = torch.sum(loss * loss_mask) / torch.sum(loss_mask)
+        return loss
+    def _save_auxiliary_states(self):
+        """
+        To save the singer's look-up table in the checkpoint saving path
+        """
+        with open(
+            os.path.join(self.tmp_checkpoint_save_path, self.cfg.preprocess.spk2id), "w"
+        ) as f:
+            json.dump(self.singers, f, indent=4, ensure_ascii=False)
+    def _build_singer_lut(self):
+        resumed_singer_path = None
+        if self.args.resume_from_ckpt_path and self.args.resume_from_ckpt_path != "":
+            resumed_singer_path = os.path.join(
+                self.args.resume_from_ckpt_path, self.cfg.preprocess.spk2id
+            )
+        if os.path.exists(os.path.join(self.exp_dir, self.cfg.preprocess.spk2id)):
+            resumed_singer_path = os.path.join(self.exp_dir, self.cfg.preprocess.spk2id)
+        if resumed_singer_path:
+            with open(resumed_singer_path, "r") as f:
+                singers = json.load(f)
+        else:
+            singers = dict()
+        for dataset in self.cfg.dataset:
+            singer_lut_path = os.path.join(
+                self.cfg.preprocess.processed_dir, dataset, self.cfg.preprocess.spk2id
+            )
+            with open(singer_lut_path, "r") as singer_lut_path:
+                singer_lut = json.load(singer_lut_path)
+            for singer in singer_lut.keys():
+                if singer not in singers:
+                    singers[singer] = len(singers)
+        with open(
+            os.path.join(self.exp_dir, self.cfg.preprocess.spk2id), "w"
+        ) as singer_file:
+            json.dump(singers, singer_file, indent=4, ensure_ascii=False)
+        print(
+            "singers have been dumped to {}".format(
+                os.path.join(self.exp_dir, self.cfg.preprocess.spk2id)
+            )
+        )
+        return singers

models/svc/comosvc/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.