japanese-mulan-base

This is a Japanese MuLan (Music-Language pretraining) model developed by LY Corporation. This model was trained on ~20k internal music-text pairs, and it is applicable to various music tasks including zero-shot music classification, text-to-music or music-to-text retrieval.

How to use

  1. Install packages
pip install transformers[torch] torchaudio sentence-transformers sentencepiece
  1. Run
import torch
import torch.nn.functional as F
import torchaudio
from transformers import AutoModel, AutoProcessor

HF_MODEL_PATH = "line-corporation/japanese-mulan-base"

model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)

url = "https://cdn.bensound.com/bensound-happyrock.mp3"  # music by Bensound.com
waveform, sample_rate = torchaudio.load(url)
# stero to mono + unbatched to batched
waveform = waveform.mean(dim=0, keepdim=True)

labels = ["ロック", "ヒップホップ", "ジャズ", "クラシック"]

processor.eval()
model.eval()

with torch.no_grad():
    music_feature = processor.get_music_feature(waveform, sample_rate=sample_rate)
    text_feature = processor.get_text_feature(labels)
    music_embedding = model.get_music_features(**music_feature)
    text_embedding = model.get_text_features(**text_feature)

# batched to unbatched
music_embedding = music_embedding.squeeze(dim=0)

# NOTE: music_embedding is not normalized by L2 norm.
similarity = F.cosine_similarity(music_embedding, text_embedding, dim=-1)
label_index = torch.argmax(similarity, dim=-1)
label = labels[label_index.item()]

print("Estimated label:", label)
# Estimated label: ロック

Model architecture

The model uses an Audio Spectrogram Transformer (AST) as the music encoder and a GLuCoSE as the text encoder. The music encoder was initialized from official AST pretrained by AudioSet. The text encoder was initialized from pkshatech/GLuCoSE-base-ja.

Licenses

The Apache License, Version 2.0

Citation

@misc{clip-japanese-base,
    title = {Japanese MuLan Base},
    author={Takuya Hasumi and Yusuke Fujita}
    url = {https://huggingface.co/line-corporation/japanese-mulan-base},
}
Downloads last month
0
Safetensors
Model size
219M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support