magenta-realtime / README.md

Add files using upload-large-folder tool

eecec21 verified 6 days ago

9.7 kB

	---
	license: cc-by-4.0
	---

	# Model Card for Magenta RT

	Authors: Google DeepMind

	Resources:

	- [Blog Post](https://g.co/magenta/rt)
	- [Colab Demo](https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/Magenta_RT_Demo.ipynb)
	- [Repository](https://github.com/magenta/magenta-realtime)
	- [HuggingFace](https://huggingface.co/google/magenta-realtime)

	## Terms of Use

	Magenta RealTime is offered under a combination of licenses: the codebase is
	licensed under
	[Apache 2.0](https://github.com/magenta/magenta-realtime/blob/main/LICENSE), and
	the model weights under
	[Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode).
	In addition, we specify the following usage terms:

	Copyright 2025 Google LLC

	Use these materials responsibly and do not generate content, including outputs,
	that infringe or violate the rights of others, including rights in copyrighted
	content.

	Google claims no rights in outputs you generate using Magenta RealTime. You and
	your users are solely responsible for outputs and their subsequent uses.

	Unless required by applicable law or agreed to in writing, all software and
	materials distributed here under the Apache 2.0 or CC-BY licenses are
	distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
	either express or implied. See the licenses for the specific language governing
	permissions and limitations under those licenses. You are solely responsible for
	determining the appropriateness of using, reproducing, modifying, performing,
	displaying or distributing the software and materials, and any outputs, and
	assume any and all risks associated with your use or distribution of any of the
	software and materials, and any outputs, and your exercise of rights and
	permissions under the licenses.

	## Model Details

	Magenta RealTime is an open music generation model from Google built from the
	same research and technology used to create
	[MusicFX DJ](https://labs.google/fx/tools/music-fx-dj) and
	[Lyria RealTime](http://goo.gle/lyria-realtime). Magenta RealTime enables the
	continuous generation of musical audio steered by a text prompt, an audio
	example, or a weighted combination of multiple text prompts and/or audio
	examples. Its relatively small size makes it possible to deploy in environments
	with limited resources, including live performance settings or freely available
	Colab TPUs.

	### System Components

	Magenta RealTime is composed of three components: SpectroStream, MusicCoCa, and
	an LLM. A full technical report is forthcoming that will explain each component
	in more detail.

	1. SpectroStream is a discrete audio codec that converts stereo 48kHz audio
	into tokens, building on the SoundStream RVQ codec from
	[Zeghidour+ 21](https://arxiv.org/abs/2107.03312)
	1. MusicCoCa is a contrastive-trained model capable of embedding audio and
	text into a common embedding space, building on
	[Yu+ 22](https://arxiv.org/abs/2205.01917) and
	[Huang+ 22](https://arxiv.org/abs/2208.12415).
	1. An encoder-decoder Transformer LLM generates audio tokens given context
	audio tokens and a tokenized MusicCoCa embedding, building on the MusicLM
	method from [Agostinelli+ 23](https://arxiv.org/abs/2301.11325)

	### Inputs and outputs

	- SpectroStream RVQ codec: Tokenizes high-fidelity music audio
	- Encoder input / Decoder output: Music audio waveforms, 48kHz stereo
	- Encoder output / Decoder input: Discrete audio tokens, 25Hz frame
	rate, 64 RVQ depth, 10 bit codes, 16kbps
	- MusicCoCa: Joint embeddings of text and music audio
	- Input: Music audio waveforms, 16kHz mono, or text representation of
	music style e.g. "heavy metal"
	- Output: 768 dimensional embedding, quantized to 12 RVQ depth, 10 bit
	codes
	- Encoder-decoder Transformer LLM: Generates audio tokens given context
	and style
	- Encoder Input: (Context, 1000 tokens) 10s of audio context tokens w/
	4 RVQ depth, (Style, 6 tokens) Quantized MusicCoCa style embedding
	- Decoder Output: (Generated, 800 tokens) 2s of audio w/ 16 RVQ depth

	## Uses

	Music generation models, in particular ones targeted for continuous real-time
	generation and control, have a wide range of applications across various
	industries and domains. The following list of potential uses is not
	comprehensive. The purpose of this list is to provide contextual information
	about the possible use-cases that the model creators considered as part of model
	training and development.

	- Interactive Music Creation
	- Live Performance / Improvisation: These models can be used to generate
	music in a live performance setting, controlled by performers
	manipulating style embeddings or the audio context
	- Accessible Music-Making & Music Therapy: People with impediments to
	using traditional instruments (skill gaps, disabilities, etc.) can
	participate in communal jam sessions or solo music creation.
	- Video Games: Developers can create a custom soundtrack for users in
	real-time based on their actions and environment.
	- Research
	- Transfer learning: Researchers can leverage representations from
	MusicCoCa and Magenta RT to recognize musical information.
	- Personalization
	- Musicians can finetune models with their own catalog to customize the
	model to their style (fine tuning support coming soon).
	- Education
	- Exploring Genres, Instruments, and History: Natural language prompting
	enables users to quickly learn about and experiment with musical
	concepts.

	### Out-of-Scope Use

	See our [Terms of Use](#terms-of-use) above for usage we consider out of scope.

	## Bias, Risks, and Limitations

	Magenta RT supports the real-time generation and steering of instrumental music.
	The purpose and intention of this capability is to foster the development of new
	real-time, interactive co-creation workflows that seamlessly integrate with
	human-centered forms of musical creativity.

	Every AI music generation model, including Magenta RT, carries a risk of
	impacting the economic and cultural landscape of music. We aim to mitigate these
	risks through the following avenues:

	- Prioritizing human-AI interaction as fundamental in the design of Magenta
	RT.
	- Distributing the model under a terms of service that prohibit developers
	from generating outputs that infringe or violate the rights of others,
	including rights in copyrighted content.
	- Training on primarily instrumental data. With specific prompting, this model
	has been observed to generate some vocal sounds and effects, though those
	vocal sounds and effects tend to be non-lexical.

	### Known limitations

	Coverage of broad musical styles. Magenta RT's training data primarily
	consists of Western instrumental music. As a consequence, Magenta RT has
	incomplete coverage of both vocal performance and the broader landscape of rich
	musical traditions worldwide. For real-time generation with broader style
	coverage, we refer users to our
	[Lyria RealTime API](g.co/magenta/lyria-realtime).

	Vocals. While the model is capable of generating non-lexical vocalizations
	and humming, it is not conditioned on lyrics and is unlikely to generate actual
	words. However, there remains some risk of generating explicit or
	culturally-insensitive lyrical content.

	Latency. Because the Magenta RT LLM operates on two second chunks, user
	inputs for the style prompt may take two or more seconds to influence the
	musical output.

	Limited context. Because the Magenta RT encoder has a maximum audio context
	window of ten seconds, the model is unable to directly reference music that has
	been output earlier than that. While the context is sufficient to enable the
	model to create melodies, rhythms, and chord progressions, the model is not
	capable of automatically creating longer-term song structures.

	### Benefits

	At the time of release, Magenta RealTime represents the only open weights model
	supporting real-time, continuous musical audio generation. It is designed
	specifically to enable live, interactive musical creation, bringing new
	capabilities to musical performances, art installations, video games, and many
	other applications.

	## How to Get Started with the Model

	See our
	[Colab demo](https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/Magenta_RT_Demo.ipynb)
	and [GitHub repository](https://github.com/magenta/magenta-realtime) for usage
	examples.

	## Training Details

	### Training Data

	Magenta RealTime was trained on ~190k hours of stock music from multiple
	sources, mostly instrumental.

	### Hardware

	Magenta RealTime was trained using
	[Tensor Processing Unit (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu)
	hardware (TPUv6e / Trillium).

	### Software

	Training was done using [JAX](https://github.com/jax-ml/jax) and
	[T5X](https://github.com/google-research/t5x), utilizing
	[SeqIO](https://github.com/google/seqio) for data pipelines. JAX allows
	researchers to take advantage of the latest generation of hardware, including
	TPUs, for faster and more efficient training of large models.

	## Evaluation

	Model evaluation metrics and results will be shared in our forthcoming technical
	report.

	## Citation

	A technical report is forthcoming. For now, please cite our
	[blog post](https://g.co/magenta/rt).

	BibTeX:

	```
	@article{magenta_rt,
	title={Magenta RealTime},
	url={https://g.co/magenta/rt},
	publisher={Google DeepMind},
	author={Lyria Team},
	year={2025}
	}
	```