File size: 5,003 Bytes
29792f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# MultiBand Diffusion
AudioCraft provides the code and models for MultiBand Diffusion, [From Discrete Tokens to High Fidelity Audio using MultiBand Diffusion][arxiv].
MultiBand diffusion is a collection of 4 models that can decode tokens from
<a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> into waveform audio.
<a target="_blank" href="https://colab.research.google.com/drive/1JlTOjB-G0A2Hz3h8PK63vLZk4xdCI5QB?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<br>
## Installation
Please follow the AudioCraft installation instructions from the [README](../README.md).
## Usage
We offer a number of way to use MultiBand Diffusion:
1. The MusicGen demo includes a toggle to try diffusion decoder. You can use the demo locally by running [`python -m demos.musicgen_app --share`](../demos/musicgen_app.py), or through the [MusicGen Colab](https://colab.research.google.com/drive/1JlTOjB-G0A2Hz3h8PK63vLZk4xdCI5QB?usp=sharing).
2. You can play with MusicGen by running the jupyter notebook at [`demos/musicgen_demo.ipynb`](../demos/musicgen_demo.ipynb) locally (if you have a GPU).
## API
We provide a simple API and pre-trained models for MusicGen and for EnCodec at 24 khz for 3 bitrates (1.5 kbps, 3 kbps and 6 kbps).
See after a quick example for using MultiBandDiffusion with the MusicGen API:
```python
import torchaudio
from audiocraft.models import MusicGen, MultiBandDiffusion
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('facebook/musicgen-melody')
mbd = MultiBandDiffusion.get_mbd_musicgen()
model.set_generation_params(duration=8) # generate 8 seconds.
wav, tokens = model.generate_unconditional(4, return_tokens=True) # generates 4 unconditional audio samples and keep the tokens for MBD generation
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
wav_diffusion = mbd.tokens_to_wav(tokens)
wav, tokens = model.generate(descriptions, return_tokens=True) # generates 3 samples and keep the tokens.
wav_diffusion = mbd.tokens_to_wav(tokens)
melody, sr = torchaudio.load('./assets/bach.mp3')
# Generates using the melody from the given audio and the provided descriptions, returns audio and audio tokens.
wav, tokens = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr, return_tokens=True)
wav_diffusion = mbd.tokens_to_wav(tokens)
for idx, one_wav in enumerate(wav):
# Will save under {idx}.wav and {idx}_diffusion.wav, with loudness normalization at -14 db LUFS for comparing the methods.
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
audio_write(f'{idx}_diffusion', wav_diffusion[idx].cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
```
For the compression task (and to compare with [EnCodec](https://github.com/facebookresearch/encodec)):
```python
import torch
from audiocraft.models import MultiBandDiffusion
from encodec import EncodecModel
from audiocraft.data.audio import audio_read, audio_write
bandwidth = 3.0 # 1.5, 3.0, 6.0
mbd = MultiBandDiffusion.get_mbd_24khz(bw=bandwidth)
encodec = EncodecModel.get_encodec_24khz()
somepath = ''
wav, sr = audio_read(somepath)
with torch.no_grad():
compressed_encodec = encodec(wav)
compressed_diffusion = mbd.regenerate(wav, sample_rate=sr)
audio_write('sample_encodec', compressed_encodec.squeeze(0).cpu(), mbd.sample_rate, strategy="loudness", loudness_compressor=True)
audio_write('sample_diffusion', compressed_diffusion.squeeze(0).cpu(), mbd.sample_rate, strategy="loudness", loudness_compressor=True)
```
## Training
The [DiffusionSolver](../audiocraft/solvers/diffusion.py) implements our diffusion training pipeline.
It generates waveform audio conditioned on the embeddings extracted from a pre-trained EnCodec model
(see [EnCodec documentation](./ENCODEC.md) for more details on how to train such model).
Note that **we do NOT provide any of the datasets** used for training our diffusion models.
We provide a dummy dataset containing just a few examples for illustrative purposes.
### Example configurations and grids
One can train diffusion models as described in the paper by using this [dora grid](../audiocraft/grids/diffusion/4_bands_base_32khz.py).
```shell
# 4 bands MBD trainning
dora grid diffusion.4_bands_base_32khz
```
### Learn more
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md).
## Citation
```
@article{sanroman2023fromdi,
title={From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion},
author={San Roman, Robin and Adi, Yossi and Deleforge, Antoine and Serizel, Romain and Synnaeve, Gabriel and Défossez, Alexandre},
journal={arXiv preprint arXiv:},
year={2023}
}
```
## License
See license information in the [README](../README.md).
[arxiv]: https://dl.fbaipublicfiles.com/encodec/Diffusion/paper.pdf
[mbd_samples]: https://ai.honu.io/papers/mbd/
|