|
# AudioGen: Textually-guided audio generation |
|
|
|
AudioCraft provides the code and a model re-implementing AudioGen, a [textually-guided audio generation][audiogen_arxiv] |
|
model that performs text-to-sound generation. |
|
|
|
The provided AudioGen reimplementation follows the LM model architecture introduced in [MusicGen][musicgen_arxiv] |
|
and is a single stage auto-regressive Transformer model trained over a 16kHz |
|
<a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz. |
|
This model variant reaches similar audio quality than the original implementation introduced in the AudioGen publication |
|
while providing faster generation speed given the smaller frame rate. |
|
|
|
**Important note:** The provided models are NOT the original models used to report numbers in the |
|
[AudioGen publication][audiogen_arxiv]. Refer to the model card to learn more about architectural changes. |
|
|
|
Listen to samples from the **original AudioGen implementation** in our [sample page][audiogen_samples]. |
|
|
|
|
|
## Model Card |
|
|
|
See [the model card](../model_cards/AUDIOGEN_MODEL_CARD.md). |
|
|
|
|
|
## Installation |
|
|
|
Please follow the AudioCraft installation instructions from the [README](../README.md). |
|
|
|
AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters). |
|
|
|
## API and usage |
|
|
|
We provide a simple API and 1 pre-trained models for AudioGen: |
|
|
|
`facebook/audiogen-medium`: 1.5B model, text to sound - [🤗 Hub](https://huggingface.co/facebook/audiogen-medium) |
|
|
|
You can play with AudioGen by running the jupyter notebook at [`demos/audiogen_demo.ipynb`](../demos/audiogen_demo.ipynb) locally (if you have a GPU). |
|
|
|
See after a quick example for using the API. |
|
|
|
```python |
|
import torchaudio |
|
from audiocraft.models import AudioGen |
|
from audiocraft.data.audio import audio_write |
|
|
|
model = AudioGen.get_pretrained('facebook/audiogen-medium') |
|
model.set_generation_params(duration=5) # generate 5 seconds. |
|
descriptions = ['dog barking', 'sirene of an emergency vehicle', 'footsteps in a corridor'] |
|
wav = model.generate(descriptions) # generates 3 samples. |
|
|
|
for idx, one_wav in enumerate(wav): |
|
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS. |
|
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True) |
|
``` |
|
|
|
## Training |
|
|
|
The [AudioGenSolver](../audiocraft/solvers/audiogen.py) implements the AudioGen's training pipeline |
|
used to develop the released model. Note that this may not fully reproduce the results presented in the paper. |
|
Similarly to MusicGen, it defines an autoregressive language modeling task over multiple streams of |
|
discrete tokens extracted from a pre-trained EnCodec model (see [EnCodec documentation](./ENCODEC.md) |
|
for more details on how to train such model) with dataset-specific changes for environmental sound |
|
processing. |
|
|
|
Note that **we do NOT provide any of the datasets** used for training AudioGen. |
|
|
|
### Example configurations and grids |
|
|
|
We provide configurations to reproduce the released models and our research. |
|
AudioGen solvers configuration are available in [config/solver/audiogen](../config/solver/audiogen). |
|
The base training configuration used for the released models is the following: |
|
[`solver=audiogen/audiogen_base_16khz`](../config/solver/audiogen/audiogen_base_16khz.yaml) |
|
|
|
Please find some example grids to train AudioGen at |
|
[audiocraft/grids/audiogen](../audiocraft/grids/audiogen/). |
|
|
|
```shell |
|
# text-to-sound |
|
dora grid audiogen.audiogen_base_16khz |
|
``` |
|
|
|
### Sound dataset and metadata |
|
|
|
AudioGen's underlying dataset is an AudioDataset augmented with description metadata. |
|
The AudioGen dataset implementation expects the metadata to be available as `.json` files |
|
at the same location as the audio files or through specified external folder. |
|
Learn more in the [datasets section](./DATASETS.md). |
|
|
|
### Evaluation stage |
|
|
|
By default, evaluation stage is also computing the cross-entropy and the perplexity over the |
|
evaluation dataset. Indeed the objective metrics used for evaluation can be costly to run |
|
or require some extra dependencies. Please refer to the [metrics documentation](./METRICS.md) |
|
for more details on the requirements for each metric. |
|
|
|
We provide an off-the-shelf configuration to enable running the objective metrics |
|
for audio generation in |
|
[config/solver/audiogen/evaluation/objective_eval](../config/solver/audiogen/evaluation/objective_eval.yaml). |
|
|
|
One can then activate evaluation the following way: |
|
```shell |
|
# using the configuration |
|
dora run solver=audiogen/debug solver/audiogen/evaluation=objective_eval |
|
# specifying each of the fields, e.g. to activate KL computation |
|
dora run solver=audiogen/debug evaluate.metrics.kld=true |
|
``` |
|
|
|
See [an example evaluation grid](../audiocraft/grids/audiogen/audiogen_pretrained_16khz_eval.py). |
|
|
|
### Generation stage |
|
|
|
The generation stage allows to generate samples conditionally and/or unconditionally and to perform |
|
audio continuation (from a prompt). We currently support greedy sampling (argmax), sampling |
|
from softmax with a given temperature, top-K and top-P (nucleus) sampling. The number of samples |
|
generated and the batch size used are controlled by the `dataset.generate` configuration |
|
while the other generation parameters are defined in `generate.lm`. |
|
|
|
```shell |
|
# control sampling parameters |
|
dora run solver=audiogen/debug generate.lm.gen_duration=5 generate.lm.use_sampling=true generate.lm.top_k=15 |
|
``` |
|
|
|
## More information |
|
|
|
Refer to [MusicGen's instructions](./MUSICGEN.md). |
|
|
|
### Learn more |
|
|
|
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md). |
|
|
|
|
|
## Citation |
|
|
|
AudioGen |
|
``` |
|
@article{kreuk2022audiogen, |
|
title={Audiogen: Textually guided audio generation}, |
|
author={Kreuk, Felix and Synnaeve, Gabriel and Polyak, Adam and Singer, Uriel and D{\'e}fossez, Alexandre and Copet, Jade and Parikh, Devi and Taigman, Yaniv and Adi, Yossi}, |
|
journal={arXiv preprint arXiv:2209.15352}, |
|
year={2022} |
|
} |
|
``` |
|
|
|
MusicGen |
|
``` |
|
@article{copet2023simple, |
|
title={Simple and Controllable Music Generation}, |
|
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, |
|
year={2023}, |
|
journal={arXiv preprint arXiv:2306.05284}, |
|
} |
|
``` |
|
|
|
## License |
|
|
|
See license information in the [model card](../model_cards/AUDIOGEN_MODEL_CARD.md). |
|
|
|
[audiogen_arxiv]: https://arxiv.org/abs/2209.15352 |
|
[musicgen_arxiv]: https://arxiv.org/abs/2306.05284 |
|
[audiogen_samples]: https://felixkreuk.github.io/audiogen/ |
|
|