Spaces:
Sleeping
Sleeping
| # Text-to-Audio with Latent Diffusion Model | |
| This is the quicktour for training a text-to-audio model with the popular and powerful generative model: [Latent Diffusion Model](https://arxiv.org/abs/2112.10752). Specially, this recipe is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper "[AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models](https://arxiv.org/abs/2304.00830)". You can check the last part of [AUDIT demos](https://audit-demo.github.io/) to see same text-to-audio examples. | |
| <br> | |
| <div align="center"> | |
| <img src="../../imgs/tta/DiffusionTTA.png" width="65%"> | |
| </div> | |
| <br> | |
| We train this latent diffusion model in two stages: | |
| 1. In the first stage, we aims to obtain a high-quality VAE (called `AutoencoderKL` in Amphion), in order that we can project | |
| the input mel-spectrograms to an efficient, low-dimensional latent space. Specially, we train the VAE with GAN loss to improve the reconstruction quality. | |
| 1. In the second stage, we aims to obtain a text-controllable diffusion model (called `AudioLDM` in Amphion). We use U-Net architecture diffusion model, and use T5 encoder as text encoder. | |
| There are four stages in total for training the text-to-audio model: | |
| 1. Data preparation and processing | |
| 2. Train the VAE model | |
| 3. Train the latent diffusion model | |
| 4. Inference | |
| > **NOTE:** You need to run every command of this recipe in the `Amphion` root path: | |
| > ```bash | |
| > cd Amphion | |
| > ``` | |
| ## Overview | |
| ```sh | |
| # Train the VAE model | |
| sh egs/tta/autoencoderkl/run_train.sh | |
| # Train the latent diffusion model | |
| sh egs/tta/audioldm/run_train.sh | |
| # Inference | |
| sh egs/tta/audioldm/run_inference.sh | |
| ``` | |
| ## 1. Data preparation and processing | |
| ### Dataset Download | |
| We take [AudioCaps](https://audiocaps.github.io/) as an example, AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information. You can download the dataset [here](https://github.com/cdjkim/audiocaps). | |
| <!-- How to download AudioCaps is detailed [here](../datasets/README.md) --> | |
| <!-- You can downlaod the dataset [here](https://github.com/cdjkim/audiocaps). --> | |
| ### Data Processing | |
| - Download AudioCaps dataset to `[Your path to save tta dataset]` and modify `preprocess.processed_dir` in `egs/tta/.../exp_config.json`. | |
| ```json | |
| { | |
| "dataset": [ | |
| "AudioCaps" | |
| ], | |
| "preprocess": { | |
| // Specify the output root path to save the processed data | |
| "processed_dir": "[Your path to save tta dataset]", | |
| ... | |
| } | |
| } | |
| ``` | |
| The folder structure of your downloaded data should be similar to: | |
| ```plaintext | |
| .../[Your path to save tta dataset] | |
| β£ AudioCpas | |
| βΒ Β β£ wav | |
| β β β£ ---1_cCGK4M_0_10000.wav | |
| β β β£ ---lTs1dxhU_30000_40000.wav | |
| β β β£ ... | |
| ``` | |
| - Then you may process the data to mel-specgram and save it as `.npy` format. If you use the data we provide, we have processed all the wav data. | |
| - Generate a json file to save the metadata, the json file is like: | |
| ```json | |
| [ | |
| { | |
| "Dataset": "AudioCaps", | |
| "Uid": "---1_cCGK4M_0_10000", | |
| "Caption": "Idling car, train blows horn and passes" | |
| }, | |
| { | |
| "Dataset": "AudioCaps", | |
| "Uid": "---lTs1dxhU_30000_40000", | |
| "Caption": "A racing vehicle engine is heard passing by" | |
| }, | |
| ... | |
| ] | |
| ``` | |
| - Finally, the folder structure is like: | |
| ```plaintext | |
| .../[Your path to save tta dataset] | |
| β£ AudioCpas | |
| βΒ Β β£ wav | |
| β β β£ ---1_cCGK4M_0_10000.wav | |
| β β β£ ---lTs1dxhU_30000_40000.wav | |
| β β β£ ... | |
| βΒ Β β£ mel | |
| β β β£ ---1_cCGK4M_0_10000.npy | |
| β β β£ ---lTs1dxhU_30000_40000.npy | |
| β β β£ ... | |
| βΒ Β β£ train.json | |
| βΒ Β β£ valid.json | |
| βΒ Β β£ ... | |
| ``` | |
| ## 2. Training the VAE Model | |
| The first stage model is a VAE model trained with GAN loss (called `AutoencoderKL` in Amphion), run the follow commands: | |
| ```sh | |
| sh egs/tta/autoencoderkl/run_train.sh | |
| ``` | |
| ## 3. Training the Latent Diffusion Model | |
| The second stage model is a condition diffusion model with a T5 text encoder (called `AudioLDM` in Amphion), run the following commands: | |
| ```sh | |
| sh egs/tta/audioldm/run_train.sh | |
| ``` | |
| ## 4. Inference | |
| Now you can generate audio with your pre-trained latent diffusion model, run the following commands and modify the `text` argument. | |
| ```sh | |
| sh egs/tta/audioldm/run_inference.sh \ | |
| --text "A man is whistling" | |
| ``` | |
| ## Citations | |
| ```bibtex | |
| @article{wang2023audit, | |
| title={AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models}, | |
| author={Wang, Yuancheng and Ju, Zeqian and Tan, Xu and He, Lei and Wu, Zhizheng and Bian, Jiang and Zhao, Sheng}, | |
| journal={NeurIPS 2023}, | |
| year={2023} | |
| } | |
| @article{liu2023audioldm, | |
| title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models}, | |
| author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D}, | |
| journal={Proceedings of the International Conference on Machine Learning}, | |
| year={2023} | |
| } | |
| ``` |