File size: 6,377 Bytes
92049f6
 
 
a193fe8
 
 
 
92049f6
c139736
 
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
 
 
 
 
 
 
 
 
721b8c3
93ccd8c
 
 
 
 
721b8c3
93ccd8c
 
 
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
 
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
 
 
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
 
 
721b8c3
93ccd8c
 
 
721b8c3
93ccd8c
721b8c3
93ccd8c
721b8c3
93ccd8c
 
721b8c3
93ccd8c
 
 
 
 
721b8c3
93ccd8c
721b8c3
93ccd8c
 
 
 
 
 
 
721b8c3
93ccd8c
 
 
 
 
721b8c3
 
93ccd8c
 
721b8c3
93ccd8c
721b8c3
93ccd8c
 
721b8c3
93ccd8c
 
721b8c3
 
93ccd8c
 
721b8c3
93ccd8c
 
 
721b8c3
 
93ccd8c
 
 
 
 
 
 
 
 
721b8c3
93ccd8c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92049f6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
tags:
- music
- musicgen
- clap
license: cc-by-nc-4.0
pipeline_tag: text-to-audio
---
(Below is from https://github.com/yuhuacheng/clap_musicgen)

# ๐Ÿ‘๐Ÿป CLAP-MusicGen ๐ŸŽต

CLAP-MusicGen is a contrastive audio-text embedding model that combines the strengths of Contrastive Language-Audio Pretraining (CLAP) with Meta's [MusicGen](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md) as the audio encoder. Users can generate latent embeddings for any given audio or text, enabling downstream tasks like music similarity search and audio classification.

**Note** that this is a **proof-of-concept** project and is not aimed at providing the highest quality embeddings but rather at demonstrating the idea, as it is my personal pet project.

## Table of Contents
- [๐Ÿ‘จโ€๐Ÿซ Overview](#-overview)
- [๐Ÿ—๏ธ Model Architecture](#-model-architecture)
- [๐Ÿ“€ Training Data](#-training-data)
- [๐Ÿ’ป Quick Start](#-quick-start)
- [๐ŸŽง Similarity Search Demo](#-similarity-search-demo)
- [๐Ÿคฟ Training / Evaluation Deep Dives](#-deeo-dives)
- [๐Ÿชช License](#-license)
- [๐Ÿ–‡๏ธ Citation](#-citation)

## ๐Ÿ‘จโ€๐Ÿซ Overview
CLAP-MusicGen is a multimodal model designed to enhance music retrieval capabilities. By embedding both audio and text into a shared space, it enables efficient music-to-music and text-to-music search. Unlike traditional models limited to predefined categories, CLAP-MusicGen supports zero-shot classification, retrieval, and embedding extraction, making it a valuable tool for exploring and organizing music collections.
### Key Capabilities:
- **MusicGen-based Audio Encoding:** Uses **MusicGen** to extract high-quality audio embeddings.
- **Two-way Retrieval:** Supports searching for audio given an input audio or text.

## ๐Ÿ—๏ธ Model Architecture
CLAP-MusicGen consists of:
1. **Audio Encoder:** Uses **MusicGenโ€™s decoder** for feature extraction given the tokenization inputs from **EnCodec**.

2. **Text Encoder:** A pretrained RoBERTa **finetuned on the music styles/genres text** with MLM objective.

3. **Projection Head:** A multi-layer perceptron (MLP) that projects both text and audio embeddings into the same space.

4. **Contrastive(ish) Learning:** Trained using a **listwise ranking loss** instead of traditional contrastive loss to optimize the alignment between text and audio embeddings, enhancing retrieval performance for tasks like music similarity search.

## ๐Ÿ“€ Training Data
The model is trained on the [nyuuzyou/suno](https://huggingface.co/datasets/nyuuzyou/suno) dataset from Hugging Face. This dataset includes approximately **10K** curated audio-caption pairs, split into 80% training, 10% validation, and 10% evaluation. Captions are derived from the `metadata.tags` field, which provides descriptions of musical styles and genres. Note that one can include the full prompt from `metadata.prompt` along with style tags for training, to achieve an even richer audio/text embeddings supervised by the full captions.

*Note: since our CLAP model is trained using AI-generated music-caption pairs from Suno, forming a synthetic data loop where an AI learns from another AIโ€™s outputs, it presents potential biases of training on AI-generated data, opening up opportunities for further refinement by incorporating human-annotated music datasets.*

## ๐Ÿ’ป Quick Start

### Installation

To install the necessary dependencies, run:

```bash
pip install torch torchvision torchaudio transformers
```

### Loading the Model from ๐Ÿค— Hugging Face

First, clone the project repository and navigate to the project directory:

```python
from src.modules.clap_model import CLAPModel
from transformers import RobertaTokenizer

model = CLAPModel.from_pretrained("yuhuacheng/clap-musicgen")
tokenizer = RobertaTokenizer.from_pretrained("yuhuacheng/clap-roberta-finetuned")
```

### Extracting Embeddings

#### From Audio

```python
import torch 

with torch.no_grad():
  waveform = torch.rand(1, 1, 32000) # 1 sec waveform at 32kHz sample rate
  audio_embeddings = model.audio_encoder(ids=None, waveform=waveform)
  print(audio_embeddings.shape) # (1, 1024)
```

#### From Text

```python
sample_captions = [
    'positive jazzy lofi',
    'fast house edm',
    'gangsta rap',
    'dark metal'
]

with torch.no_grad():
    tokenized_captions = tokenizer(list(sample_captions), return_tensors="pt", padding=True, truncation=True)    
    text_embeddings = model.text_encoder(ids=None, **tokenized_captions)
    print(text_embeddings.shape) # (4, 1024)
```


## ๐ŸŽง Similarity Search Demo
Please refer to the [demo](demo.ipynb) notebook that demonstates the **audio-to-audio** as well as the **text-to-audio** search.

*(Result snapshots)*

๐ŸŽต Audio-to-Audio Search
![Audio to Audios](images/audio_to_audio.png)

๐Ÿ’ฌ Text-to-Audio Search
![Text to Audios](images/text_to_audio.png)


## ๐Ÿคฟ Training / Evaluation Deep Dives
(Coming soon)

## ๐Ÿชช License
* The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).
* Since the model was trained off of the pretrained MusicGen weights, its weights in this repository are released under the CC-BY-NC 4.0 license as found in the [LICENSE_weights file](LICENSE_weights).


## ๐Ÿ–‡๏ธ Citation
```
@inproceedings{copet2023simple,
    title={Simple and Controllable Music Generation},
    author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Dรฉfossez},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}
```

```
@inproceedings{laionclap2023,
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2023}
}
```
```
@inproceedings{htsatke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2022}
}
```