---
tags:
- music
- musicgen
- clap
license: cc-by-nc-4.0
pipeline_tag: text-to-audio
---
(Below is from https://github.com/yuhuacheng/clap_musicgen)

# 👏🏻 CLAP-MusicGen 🎵

CLAP-MusicGen is a contrastive audio-text embedding model that combines the strengths of Contrastive Language-Audio Pretraining (CLAP) with Meta's [MusicGen](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md) as the audio encoder. Users can generate latent embeddings for any given audio or text, enabling downstream tasks like music similarity search and audio classification.

**Note** that this is a **proof-of-concept** project and is not aimed at providing the highest quality embeddings but rather at demonstrating the idea, as it is my personal pet project.

## Table of Contents
- [👨‍🏫 Overview](#-overview)
- [🏗️ Model Architecture](#-model-architecture)
- [📀 Training Data](#-training-data)
- [💻 Quick Start](#-quick-start)
- [🎧 Similarity Search Demo](#-similarity-search-demo)
- [🤿 Training / Evaluation Deep Dives](#-deeo-dives)
- [🪪 License](#-license)
- [🖇️ Citation](#-citation)

## 👨‍🏫 Overview
CLAP-MusicGen is a multimodal model designed to enhance music retrieval capabilities. By embedding both audio and text into a shared space, it enables efficient music-to-music and text-to-music search. Unlike traditional models limited to predefined categories, CLAP-MusicGen supports zero-shot classification, retrieval, and embedding extraction, making it a valuable tool for exploring and organizing music collections.
### Key Capabilities:
- **MusicGen-based Audio Encoding:** Uses **MusicGen** to extract high-quality audio embeddings.
- **Two-way Retrieval:** Supports searching for audio given an input audio or text.

## 🏗️ Model Architecture
CLAP-MusicGen consists of:
1. **Audio Encoder:** Uses **MusicGen’s decoder** for feature extraction given the tokenization inputs from **EnCodec**.

2. **Text Encoder:** A pretrained RoBERTa **finetuned on the music styles/genres text** with MLM objective.

3. **Projection Head:** A multi-layer perceptron (MLP) that projects both text and audio embeddings into the same space.

4. **Contrastive(ish) Learning:** Trained using a **listwise ranking loss** instead of traditional contrastive loss to optimize the alignment between text and audio embeddings, enhancing retrieval performance for tasks like music similarity search.

## 📀 Training Data
The model is trained on the [nyuuzyou/suno](https://huggingface.co/datasets/nyuuzyou/suno) dataset from Hugging Face. This dataset includes approximately **10K** curated audio-caption pairs, split into 80% training, 10% validation, and 10% evaluation. Captions are derived from the `metadata.tags` field, which provides descriptions of musical styles and genres. Note that one can include the full prompt from `metadata.prompt` along with style tags for training, to achieve an even richer audio/text embeddings supervised by the full captions.

*Note: since our CLAP model is trained using AI-generated music-caption pairs from Suno, forming a synthetic data loop where an AI learns from another AI’s outputs, it presents potential biases of training on AI-generated data, opening up opportunities for further refinement by incorporating human-annotated music datasets.*

## 💻 Quick Start

### Installation

To install the necessary dependencies, run:

```bash
pip install torch torchvision torchaudio transformers
```

### Loading the Model from 🤗 Hugging Face

First, clone the project repository and navigate to the project directory:

```python
from src.modules.clap_model import CLAPModel
from transformers import RobertaTokenizer

model = CLAPModel.from_pretrained("yuhuacheng/clap-musicgen")
tokenizer = RobertaTokenizer.from_pretrained("yuhuacheng/clap-roberta-finetuned")
```

### Extracting Embeddings

#### From Audio

```python
import torch 

with torch.no_grad():
  waveform = torch.rand(1, 1, 32000) # 1 sec waveform at 32kHz sample rate
  audio_embeddings = model.audio_encoder(ids=None, waveform=waveform)
  print(audio_embeddings.shape) # (1, 1024)
```

#### From Text

```python
sample_captions = [
    'positive jazzy lofi',
    'fast house edm',
    'gangsta rap',
    'dark metal'
]

with torch.no_grad():
    tokenized_captions = tokenizer(list(sample_captions), return_tensors="pt", padding=True, truncation=True)    
    text_embeddings = model.text_encoder(ids=None, **tokenized_captions)
    print(text_embeddings.shape) # (4, 1024)
```


## 🎧 Similarity Search Demo
Please refer to the [demo](demo.ipynb) notebook that demonstates the **audio-to-audio** as well as the **text-to-audio** search.

*(Result snapshots)*

🎵 Audio-to-Audio Search
![Audio to Audios](images/audio_to_audio.png)

💬 Text-to-Audio Search
![Text to Audios](images/text_to_audio.png)


## 🤿 Training / Evaluation Deep Dives
(Coming soon)

## 🪪 License
* The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).
* Since the model was trained off of the pretrained MusicGen weights, its weights in this repository are released under the CC-BY-NC 4.0 license as found in the [LICENSE_weights file](LICENSE_weights).


## 🖇️ Citation
```
@inproceedings{copet2023simple,
    title={Simple and Controllable Music Generation},
    author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}
```

```
@inproceedings{laionclap2023,
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2023}
}
```
```
@inproceedings{htsatke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2022}
}
```