TF-Keras
magenta-realtime / README.md
chrisdonahue-goog's picture
Add files using upload-large-folder tool
eecec21 verified
---
license: cc-by-4.0
---
# Model Card for Magenta RT
**Authors**: Google DeepMind
**Resources**:
- [Blog Post](https://g.co/magenta/rt)
- [Colab Demo](https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/Magenta_RT_Demo.ipynb)
- [Repository](https://github.com/magenta/magenta-realtime)
- [HuggingFace](https://huggingface.co/google/magenta-realtime)
## Terms of Use
Magenta RealTime is offered under a combination of licenses: the codebase is
licensed under
[Apache 2.0](https://github.com/magenta/magenta-realtime/blob/main/LICENSE), and
the model weights under
[Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode).
In addition, we specify the following usage terms:
Copyright 2025 Google LLC
Use these materials responsibly and do not generate content, including outputs,
that infringe or violate the rights of others, including rights in copyrighted
content.
Google claims no rights in outputs you generate using Magenta RealTime. You and
your users are solely responsible for outputs and their subsequent uses.
Unless required by applicable law or agreed to in writing, all software and
materials distributed here under the Apache 2.0 or CC-BY licenses are
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied. See the licenses for the specific language governing
permissions and limitations under those licenses. You are solely responsible for
determining the appropriateness of using, reproducing, modifying, performing,
displaying or distributing the software and materials, and any outputs, and
assume any and all risks associated with your use or distribution of any of the
software and materials, and any outputs, and your exercise of rights and
permissions under the licenses.
## Model Details
Magenta RealTime is an open music generation model from Google built from the
same research and technology used to create
[MusicFX DJ](https://labs.google/fx/tools/music-fx-dj) and
[Lyria RealTime](http://goo.gle/lyria-realtime). Magenta RealTime enables the
continuous generation of musical audio steered by a text prompt, an audio
example, or a weighted combination of multiple text prompts and/or audio
examples. Its relatively small size makes it possible to deploy in environments
with limited resources, including live performance settings or freely available
Colab TPUs.
### System Components
Magenta RealTime is composed of three components: SpectroStream, MusicCoCa, and
an LLM. A full technical report is forthcoming that will explain each component
in more detail.
1. **SpectroStream** is a discrete audio codec that converts stereo 48kHz audio
into tokens, building on the SoundStream RVQ codec from
[Zeghidour+ 21](https://arxiv.org/abs/2107.03312)
1. **MusicCoCa** is a contrastive-trained model capable of embedding audio and
text into a common embedding space, building on
[Yu+ 22](https://arxiv.org/abs/2205.01917) and
[Huang+ 22](https://arxiv.org/abs/2208.12415).
1. An **encoder-decoder Transformer LLM** generates audio tokens given context
audio tokens and a tokenized MusicCoCa embedding, building on the MusicLM
method from [Agostinelli+ 23](https://arxiv.org/abs/2301.11325)
### Inputs and outputs
- **SpectroStream RVQ codec**: Tokenizes high-fidelity music audio
- **Encoder input / Decoder output**: Music audio waveforms, 48kHz stereo
- **Encoder output / Decoder input**: Discrete audio tokens, 25Hz frame
rate, 64 RVQ depth, 10 bit codes, 16kbps
- **MusicCoCa**: Joint embeddings of text and music audio
- **Input**: Music audio waveforms, 16kHz mono, or text representation of
music style e.g. "heavy metal"
- **Output**: 768 dimensional embedding, quantized to 12 RVQ depth, 10 bit
codes
- **Encoder-decoder Transformer LLM**: Generates audio tokens given context
and style
- **Encoder Input**: (Context, 1000 tokens) 10s of audio context tokens w/
4 RVQ depth, (Style, 6 tokens) Quantized MusicCoCa style embedding
- **Decoder Output**: (Generated, 800 tokens) 2s of audio w/ 16 RVQ depth
## Uses
Music generation models, in particular ones targeted for continuous real-time
generation and control, have a wide range of applications across various
industries and domains. The following list of potential uses is not
comprehensive. The purpose of this list is to provide contextual information
about the possible use-cases that the model creators considered as part of model
training and development.
- **Interactive Music Creation**
- Live Performance / Improvisation: These models can be used to generate
music in a live performance setting, controlled by performers
manipulating style embeddings or the audio context
- Accessible Music-Making & Music Therapy: People with impediments to
using traditional instruments (skill gaps, disabilities, etc.) can
participate in communal jam sessions or solo music creation.
- Video Games: Developers can create a custom soundtrack for users in
real-time based on their actions and environment.
- **Research**
- Transfer learning: Researchers can leverage representations from
MusicCoCa and Magenta RT to recognize musical information.
- **Personalization**
- Musicians can finetune models with their own catalog to customize the
model to their style (fine tuning support coming soon).
- **Education**
- Exploring Genres, Instruments, and History: Natural language prompting
enables users to quickly learn about and experiment with musical
concepts.
### Out-of-Scope Use
See our [Terms of Use](#terms-of-use) above for usage we consider out of scope.
## Bias, Risks, and Limitations
Magenta RT supports the real-time generation and steering of instrumental music.
The purpose and intention of this capability is to foster the development of new
real-time, interactive co-creation workflows that seamlessly integrate with
human-centered forms of musical creativity.
Every AI music generation model, including Magenta RT, carries a risk of
impacting the economic and cultural landscape of music. We aim to mitigate these
risks through the following avenues:
- Prioritizing human-AI interaction as fundamental in the design of Magenta
RT.
- Distributing the model under a terms of service that prohibit developers
from generating outputs that infringe or violate the rights of others,
including rights in copyrighted content.
- Training on primarily instrumental data. With specific prompting, this model
has been observed to generate some vocal sounds and effects, though those
vocal sounds and effects tend to be non-lexical.
### Known limitations
**Coverage of broad musical styles**. Magenta RT's training data primarily
consists of Western instrumental music. As a consequence, Magenta RT has
incomplete coverage of both vocal performance and the broader landscape of rich
musical traditions worldwide. For real-time generation with broader style
coverage, we refer users to our
[Lyria RealTime API](g.co/magenta/lyria-realtime).
**Vocals**. While the model is capable of generating non-lexical vocalizations
and humming, it is not conditioned on lyrics and is unlikely to generate actual
words. However, there remains some risk of generating explicit or
culturally-insensitive lyrical content.
**Latency**. Because the Magenta RT LLM operates on two second chunks, user
inputs for the style prompt may take two or more seconds to influence the
musical output.
**Limited context**. Because the Magenta RT encoder has a maximum audio context
window of ten seconds, the model is unable to directly reference music that has
been output earlier than that. While the context is sufficient to enable the
model to create melodies, rhythms, and chord progressions, the model is not
capable of automatically creating longer-term song structures.
### Benefits
At the time of release, Magenta RealTime represents the only open weights model
supporting real-time, continuous musical audio generation. It is designed
specifically to enable live, interactive musical creation, bringing new
capabilities to musical performances, art installations, video games, and many
other applications.
## How to Get Started with the Model
See our
[Colab demo](https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/Magenta_RT_Demo.ipynb)
and [GitHub repository](https://github.com/magenta/magenta-realtime) for usage
examples.
## Training Details
### Training Data
Magenta RealTime was trained on ~190k hours of stock music from multiple
sources, mostly instrumental.
### Hardware
Magenta RealTime was trained using
[Tensor Processing Unit (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu)
hardware (TPUv6e / Trillium).
### Software
Training was done using [JAX](https://github.com/jax-ml/jax) and
[T5X](https://github.com/google-research/t5x), utilizing
[SeqIO](https://github.com/google/seqio) for data pipelines. JAX allows
researchers to take advantage of the latest generation of hardware, including
TPUs, for faster and more efficient training of large models.
## Evaluation
Model evaluation metrics and results will be shared in our forthcoming technical
report.
## Citation
A technical report is forthcoming. For now, please cite our
[blog post](https://g.co/magenta/rt).
**BibTeX:**
```
@article{magenta_rt,
title={Magenta RealTime},
url={https://g.co/magenta/rt},
publisher={Google DeepMind},
author={Lyria Team},
year={2025}
}
```