Papers
arxiv:2405.09818

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Published on May 16
· Submitted by akhaliq on May 17
#1 Paper of the day
Authors:

Abstract

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Community

Nice paper! tiny nit: It sounds like there is supposed to be a comparison to Llava-1.5, but it is missing from the image-to-text results table.

·

purpose is completely different from llava.

Will there be model/code release?

·

Maybe eventually, it seems to just be a paper right now it seems?

Read it, good training strategies. Thanks

Thanks

Great work! I like the discussion around training stability!

I had a few questions:

a/

We narrowed down the cause of the divergence to the softmax operation being
problematic when training with multiple modalities of significantly varying entropy due to the translation
invariant property of softmax (i.e., sof tmax(z) = sof tmax(z + c)). Because we share all weights of the model
across modalities, each modality will try to “compete” with the other by increasing its norms slightly

Can you expand on this explanation?

b/ In figure 6b, does "Training loss curve with image generation disabled does not suffer from
instability issues." mean that the data is only pure text or does it mean that you do not compute the loss (and thus gradients) on the image tokens

c/ one of the long-lasting question for these types of multimodal models is whether they are more sample efficient (transfer between modalities) or learn something they were not able to learn just from observing pure text. do you have any insights into that question with the chamaleon models?

The Future of AI: Chameleon’s Breakthrough in Multimodal Models

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

After reading the paper I can not find the way about how to decode from your codebook embedding into a image. Is there any decoder trained jointly with transformer architecture?

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.09818 in a dataset README.md to link it from this page.

Spaces citing this paper 6

Collections including this paper 31