|
--- |
|
license: cc-by-4.0 |
|
--- |
|
|
|
# Model Card for Magenta RT |
|
|
|
**Authors**: Google DeepMind |
|
|
|
**Resources**: |
|
|
|
- [Blog Post](https://g.co/magenta/rt) |
|
- [Colab Demo](https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/Magenta_RT_Demo.ipynb) |
|
- [Repository](https://github.com/magenta/magenta-realtime) |
|
- [HuggingFace](https://huggingface.co/google/magenta-realtime) |
|
|
|
## Terms of Use |
|
|
|
Magenta RealTime is offered under a combination of licenses: the codebase is |
|
licensed under |
|
[Apache 2.0](https://github.com/magenta/magenta-realtime/blob/main/LICENSE), and |
|
the model weights under |
|
[Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode). |
|
In addition, we specify the following usage terms: |
|
|
|
Copyright 2025 Google LLC |
|
|
|
Use these materials responsibly and do not generate content, including outputs, |
|
that infringe or violate the rights of others, including rights in copyrighted |
|
content. |
|
|
|
Google claims no rights in outputs you generate using Magenta RealTime. You and |
|
your users are solely responsible for outputs and their subsequent uses. |
|
|
|
Unless required by applicable law or agreed to in writing, all software and |
|
materials distributed here under the Apache 2.0 or CC-BY licenses are |
|
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, |
|
either express or implied. See the licenses for the specific language governing |
|
permissions and limitations under those licenses. You are solely responsible for |
|
determining the appropriateness of using, reproducing, modifying, performing, |
|
displaying or distributing the software and materials, and any outputs, and |
|
assume any and all risks associated with your use or distribution of any of the |
|
software and materials, and any outputs, and your exercise of rights and |
|
permissions under the licenses. |
|
|
|
## Model Details |
|
|
|
Magenta RealTime is an open music generation model from Google built from the |
|
same research and technology used to create |
|
[MusicFX DJ](https://labs.google/fx/tools/music-fx-dj) and |
|
[Lyria RealTime](http://goo.gle/lyria-realtime). Magenta RealTime enables the |
|
continuous generation of musical audio steered by a text prompt, an audio |
|
example, or a weighted combination of multiple text prompts and/or audio |
|
examples. Its relatively small size makes it possible to deploy in environments |
|
with limited resources, including live performance settings or freely available |
|
Colab TPUs. |
|
|
|
### System Components |
|
|
|
Magenta RealTime is composed of three components: SpectroStream, MusicCoCa, and |
|
an LLM. A full technical report is forthcoming that will explain each component |
|
in more detail. |
|
|
|
1. **SpectroStream** is a discrete audio codec that converts stereo 48kHz audio |
|
into tokens, building on the SoundStream RVQ codec from |
|
[Zeghidour+ 21](https://arxiv.org/abs/2107.03312) |
|
1. **MusicCoCa** is a contrastive-trained model capable of embedding audio and |
|
text into a common embedding space, building on |
|
[Yu+ 22](https://arxiv.org/abs/2205.01917) and |
|
[Huang+ 22](https://arxiv.org/abs/2208.12415). |
|
1. An **encoder-decoder Transformer LLM** generates audio tokens given context |
|
audio tokens and a tokenized MusicCoCa embedding, building on the MusicLM |
|
method from [Agostinelli+ 23](https://arxiv.org/abs/2301.11325) |
|
|
|
### Inputs and outputs |
|
|
|
- **SpectroStream RVQ codec**: Tokenizes high-fidelity music audio |
|
- **Encoder input / Decoder output**: Music audio waveforms, 48kHz stereo |
|
- **Encoder output / Decoder input**: Discrete audio tokens, 25Hz frame |
|
rate, 64 RVQ depth, 10 bit codes, 16kbps |
|
- **MusicCoCa**: Joint embeddings of text and music audio |
|
- **Input**: Music audio waveforms, 16kHz mono, or text representation of |
|
music style e.g. "heavy metal" |
|
- **Output**: 768 dimensional embedding, quantized to 12 RVQ depth, 10 bit |
|
codes |
|
- **Encoder-decoder Transformer LLM**: Generates audio tokens given context |
|
and style |
|
- **Encoder Input**: (Context, 1000 tokens) 10s of audio context tokens w/ |
|
4 RVQ depth, (Style, 6 tokens) Quantized MusicCoCa style embedding |
|
- **Decoder Output**: (Generated, 800 tokens) 2s of audio w/ 16 RVQ depth |
|
|
|
## Uses |
|
|
|
Music generation models, in particular ones targeted for continuous real-time |
|
generation and control, have a wide range of applications across various |
|
industries and domains. The following list of potential uses is not |
|
comprehensive. The purpose of this list is to provide contextual information |
|
about the possible use-cases that the model creators considered as part of model |
|
training and development. |
|
|
|
- **Interactive Music Creation** |
|
- Live Performance / Improvisation: These models can be used to generate |
|
music in a live performance setting, controlled by performers |
|
manipulating style embeddings or the audio context |
|
- Accessible Music-Making & Music Therapy: People with impediments to |
|
using traditional instruments (skill gaps, disabilities, etc.) can |
|
participate in communal jam sessions or solo music creation. |
|
- Video Games: Developers can create a custom soundtrack for users in |
|
real-time based on their actions and environment. |
|
- **Research** |
|
- Transfer learning: Researchers can leverage representations from |
|
MusicCoCa and Magenta RT to recognize musical information. |
|
- **Personalization** |
|
- Musicians can finetune models with their own catalog to customize the |
|
model to their style (fine tuning support coming soon). |
|
- **Education** |
|
- Exploring Genres, Instruments, and History: Natural language prompting |
|
enables users to quickly learn about and experiment with musical |
|
concepts. |
|
|
|
### Out-of-Scope Use |
|
|
|
See our [Terms of Use](#terms-of-use) above for usage we consider out of scope. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Magenta RT supports the real-time generation and steering of instrumental music. |
|
The purpose and intention of this capability is to foster the development of new |
|
real-time, interactive co-creation workflows that seamlessly integrate with |
|
human-centered forms of musical creativity. |
|
|
|
Every AI music generation model, including Magenta RT, carries a risk of |
|
impacting the economic and cultural landscape of music. We aim to mitigate these |
|
risks through the following avenues: |
|
|
|
- Prioritizing human-AI interaction as fundamental in the design of Magenta |
|
RT. |
|
- Distributing the model under a terms of service that prohibit developers |
|
from generating outputs that infringe or violate the rights of others, |
|
including rights in copyrighted content. |
|
- Training on primarily instrumental data. With specific prompting, this model |
|
has been observed to generate some vocal sounds and effects, though those |
|
vocal sounds and effects tend to be non-lexical. |
|
|
|
### Known limitations |
|
|
|
**Coverage of broad musical styles**. Magenta RT's training data primarily |
|
consists of Western instrumental music. As a consequence, Magenta RT has |
|
incomplete coverage of both vocal performance and the broader landscape of rich |
|
musical traditions worldwide. For real-time generation with broader style |
|
coverage, we refer users to our |
|
[Lyria RealTime API](g.co/magenta/lyria-realtime). |
|
|
|
**Vocals**. While the model is capable of generating non-lexical vocalizations |
|
and humming, it is not conditioned on lyrics and is unlikely to generate actual |
|
words. However, there remains some risk of generating explicit or |
|
culturally-insensitive lyrical content. |
|
|
|
**Latency**. Because the Magenta RT LLM operates on two second chunks, user |
|
inputs for the style prompt may take two or more seconds to influence the |
|
musical output. |
|
|
|
**Limited context**. Because the Magenta RT encoder has a maximum audio context |
|
window of ten seconds, the model is unable to directly reference music that has |
|
been output earlier than that. While the context is sufficient to enable the |
|
model to create melodies, rhythms, and chord progressions, the model is not |
|
capable of automatically creating longer-term song structures. |
|
|
|
### Benefits |
|
|
|
At the time of release, Magenta RealTime represents the only open weights model |
|
supporting real-time, continuous musical audio generation. It is designed |
|
specifically to enable live, interactive musical creation, bringing new |
|
capabilities to musical performances, art installations, video games, and many |
|
other applications. |
|
|
|
## How to Get Started with the Model |
|
|
|
See our |
|
[Colab demo](https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/Magenta_RT_Demo.ipynb) |
|
and [GitHub repository](https://github.com/magenta/magenta-realtime) for usage |
|
examples. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Magenta RealTime was trained on ~190k hours of stock music from multiple |
|
sources, mostly instrumental. |
|
|
|
### Hardware |
|
|
|
Magenta RealTime was trained using |
|
[Tensor Processing Unit (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu) |
|
hardware (TPUv6e / Trillium). |
|
|
|
### Software |
|
|
|
Training was done using [JAX](https://github.com/jax-ml/jax) and |
|
[T5X](https://github.com/google-research/t5x), utilizing |
|
[SeqIO](https://github.com/google/seqio) for data pipelines. JAX allows |
|
researchers to take advantage of the latest generation of hardware, including |
|
TPUs, for faster and more efficient training of large models. |
|
|
|
## Evaluation |
|
|
|
Model evaluation metrics and results will be shared in our forthcoming technical |
|
report. |
|
|
|
## Citation |
|
|
|
A technical report is forthcoming. For now, please cite our |
|
[blog post](https://g.co/magenta/rt). |
|
|
|
**BibTeX:** |
|
|
|
``` |
|
@article{magenta_rt, |
|
title={Magenta RealTime}, |
|
url={https://g.co/magenta/rt}, |
|
publisher={Google DeepMind}, |
|
author={Lyria Team}, |
|
year={2025} |
|
} |
|
``` |
|
|