File size: 6,377 Bytes
92049f6 a193fe8 92049f6 c139736 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 721b8c3 93ccd8c 92049f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
tags:
- music
- musicgen
- clap
license: cc-by-nc-4.0
pipeline_tag: text-to-audio
---
(Below is from https://github.com/yuhuacheng/clap_musicgen)
# ๐๐ป CLAP-MusicGen ๐ต
CLAP-MusicGen is a contrastive audio-text embedding model that combines the strengths of Contrastive Language-Audio Pretraining (CLAP) with Meta's [MusicGen](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md) as the audio encoder. Users can generate latent embeddings for any given audio or text, enabling downstream tasks like music similarity search and audio classification.
**Note** that this is a **proof-of-concept** project and is not aimed at providing the highest quality embeddings but rather at demonstrating the idea, as it is my personal pet project.
## Table of Contents
- [๐จโ๐ซ Overview](#-overview)
- [๐๏ธ Model Architecture](#-model-architecture)
- [๐ Training Data](#-training-data)
- [๐ป Quick Start](#-quick-start)
- [๐ง Similarity Search Demo](#-similarity-search-demo)
- [๐คฟ Training / Evaluation Deep Dives](#-deeo-dives)
- [๐ชช License](#-license)
- [๐๏ธ Citation](#-citation)
## ๐จโ๐ซ Overview
CLAP-MusicGen is a multimodal model designed to enhance music retrieval capabilities. By embedding both audio and text into a shared space, it enables efficient music-to-music and text-to-music search. Unlike traditional models limited to predefined categories, CLAP-MusicGen supports zero-shot classification, retrieval, and embedding extraction, making it a valuable tool for exploring and organizing music collections.
### Key Capabilities:
- **MusicGen-based Audio Encoding:** Uses **MusicGen** to extract high-quality audio embeddings.
- **Two-way Retrieval:** Supports searching for audio given an input audio or text.
## ๐๏ธ Model Architecture
CLAP-MusicGen consists of:
1. **Audio Encoder:** Uses **MusicGenโs decoder** for feature extraction given the tokenization inputs from **EnCodec**.
2. **Text Encoder:** A pretrained RoBERTa **finetuned on the music styles/genres text** with MLM objective.
3. **Projection Head:** A multi-layer perceptron (MLP) that projects both text and audio embeddings into the same space.
4. **Contrastive(ish) Learning:** Trained using a **listwise ranking loss** instead of traditional contrastive loss to optimize the alignment between text and audio embeddings, enhancing retrieval performance for tasks like music similarity search.
## ๐ Training Data
The model is trained on the [nyuuzyou/suno](https://huggingface.co/datasets/nyuuzyou/suno) dataset from Hugging Face. This dataset includes approximately **10K** curated audio-caption pairs, split into 80% training, 10% validation, and 10% evaluation. Captions are derived from the `metadata.tags` field, which provides descriptions of musical styles and genres. Note that one can include the full prompt from `metadata.prompt` along with style tags for training, to achieve an even richer audio/text embeddings supervised by the full captions.
*Note: since our CLAP model is trained using AI-generated music-caption pairs from Suno, forming a synthetic data loop where an AI learns from another AIโs outputs, it presents potential biases of training on AI-generated data, opening up opportunities for further refinement by incorporating human-annotated music datasets.*
## ๐ป Quick Start
### Installation
To install the necessary dependencies, run:
```bash
pip install torch torchvision torchaudio transformers
```
### Loading the Model from ๐ค Hugging Face
First, clone the project repository and navigate to the project directory:
```python
from src.modules.clap_model import CLAPModel
from transformers import RobertaTokenizer
model = CLAPModel.from_pretrained("yuhuacheng/clap-musicgen")
tokenizer = RobertaTokenizer.from_pretrained("yuhuacheng/clap-roberta-finetuned")
```
### Extracting Embeddings
#### From Audio
```python
import torch
with torch.no_grad():
waveform = torch.rand(1, 1, 32000) # 1 sec waveform at 32kHz sample rate
audio_embeddings = model.audio_encoder(ids=None, waveform=waveform)
print(audio_embeddings.shape) # (1, 1024)
```
#### From Text
```python
sample_captions = [
'positive jazzy lofi',
'fast house edm',
'gangsta rap',
'dark metal'
]
with torch.no_grad():
tokenized_captions = tokenizer(list(sample_captions), return_tensors="pt", padding=True, truncation=True)
text_embeddings = model.text_encoder(ids=None, **tokenized_captions)
print(text_embeddings.shape) # (4, 1024)
```
## ๐ง Similarity Search Demo
Please refer to the [demo](demo.ipynb) notebook that demonstates the **audio-to-audio** as well as the **text-to-audio** search.
*(Result snapshots)*
๐ต Audio-to-Audio Search

๐ฌ Text-to-Audio Search

## ๐คฟ Training / Evaluation Deep Dives
(Coming soon)
## ๐ชช License
* The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).
* Since the model was trained off of the pretrained MusicGen weights, its weights in this repository are released under the CC-BY-NC 4.0 license as found in the [LICENSE_weights file](LICENSE_weights).
## ๐๏ธ Citation
```
@inproceedings{copet2023simple,
title={Simple and Controllable Music Generation},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Dรฉfossez},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
}
```
```
@inproceedings{laionclap2023,
title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
year = {2023}
}
```
```
@inproceedings{htsatke2022,
author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
year = {2022}
}
``` |