Unified Models for Image Understanding and Generation: Understanding Cutting-Edge Model Architectures

Community Article Published September 15, 2025

Generative multimodal models have been a research hotspot in the industry in recent years. Vision Language Models (VLMs) have been the core approach in multimodal text generation, capable of completing image understanding tasks, while Diffusion Models have been the core method in image and video generation. Earlier this year, unified models that simultaneously support image understanding and generation have emerged like mushrooms after rain. Unified models are favored not only for their versatility in supporting both understanding and generation tasks, but also because researchers see the potential for multimodal learning brought by the organic combination of tasks. On one hand, the organic combination of both tasks enables models to jointly optimize on both tasks, improving the utilization rate of interleaved image-text data, while also showing the academic community the potential for mutual promotion between tasks. On the other hand, multimodal output support allows unified models to have more possibilities in the currently popular model reasoning, such as developing reasoning based on generated images and image generation based on reasoning.

Starting from understanding and generation tasks, before unified models, understanding tasks were completed by Vision Language Models following the autoregressive (AR) route, while generation tasks were completed by Diffusion Models following DDPM or Flow Matching routes. Therefore, the technology of unified models also starts from these two routes. This article categorizes current typical unified model work into four types: pure autoregressive route, AR+Diffusion serial structure, AR+Diffusion parallel structure, and single model simultaneously doing AR+Diffusion. Since the technology and ecosystem for Diffusion to do text understanding are not yet mature, this article does not cover unified models based solely on Diffusion.

Note that this article is not a complete survey and may not cover all papers. These four categories are designed to better capture the differences between different models. The fastest way to understand a unified model is to clarify how it handles image understanding and generation tasks. For image understanding tasks, current unified models generally follow the VLM approach, encoding images into embeddings and processing them uniformly with LLMs - this route has been extensively validated. The difference between different route methods in understanding tasks lies in the image encoding method. For image generation task implementation, the differences between different models mainly reflect in several aspects: how images are encoded, how image encodings are generated, and how images are decoded. For each route below, we will focus on these key differences.

Pure Autoregressive Route Unified Models

Autoregression means predicting the next token based on the input sequence and feeding the predicted token back into the input for recursive prediction. Pure autoregressive route unified models can be seen as a combination of text token prediction in LLMs and image token prediction from VQGAN[1]. Typical works include LWM[2], Chameleon[3], Emu3[4], Janus[5], and Janus-Pro[6]. LWM and Chameleon are relatively early works on unified training of text and images, Emu3 further extends to video generation modality, while Janus and Janus-Pro separate the encoding for image understanding. Here we analyze models from this route using Chameleon and Janus.

Chameleon's model architecture is shown in the figure below. For image understanding tasks, it first uses VQ-VAE's Encoder as an Image Tokenizer to encode images into discrete embeddings, then uses an autoregressive model to predict text output. For image generation tasks, it uses VQ-VAE's Decoder as an Image De-Tokenizer to decode discrete image tokens predicted by the autoregressive model into images.

image/png

Based on this architecture, the Janus team believes that VQ-VAE's Encoder is trained through reconstruction tasks and is not suitable for semantic space image understanding tasks. Therefore, they modified the Image-Tokenizer for image understanding tasks to use SigLIP trained on image-text pairs. The architecture is shown below: Und. Encoder (understanding part) uses SigLIP, while Gen. Encoder (generation part) and Image Decoder use VQVAE.

image/png

Simply put, pure autoregressive route unified models incorporate VQ-VAE's discrete image tokens into the unified training of LLMs or VLMs. The advantage is that discrete image token prediction tasks highly align with LLM pretraining paradigms and are very consistent with AR model characteristics. However, from an image quality perspective, models from this route generate unsatisfactory image quality, partly due to quality loss from discretization of image encoding space, and partly because autoregressive models cannot do distribution modeling like diffusion models. Additionally, the inability to introduce random noise makes poor diversity in generated images another major challenge for this route.

AR+Diffusion Serial Structure Unified Models

AR and Diffusion are the mainstream routes in image understanding and generation respectively, so perhaps the simplest and most effective unified model route is to connect AR and Diffusion in series, where the AR model completes understanding tasks and the AR model's input serves as conditions for Diffusion to complete generation tasks, as shown in the architecture below (source: unified model survey[7]).

image/png

For understanding tasks, images are encoded by semantic encoders (usually CLIP, SigLIP, or ViT trained with image-text alignment) into continuous embeddings. For generation (text-to-image) tasks, the model processes text input and outputs an intermediate embedding as a condition for the Diffusion model to generate images. Here, the image encoding for generation tasks is the intermediate embedding between AR and Diffusion models, directly produced by the AR model. Based on whether intermediate embeddings are explicitly supervised, we categorize several typical works into two types.

2.1 Unified Modeling Methods Using Semantic Embedding Supervision

As the name suggests, these methods use loss functions to directly supervise the AR model's output image embeddings, giving them a clear embedding output target, while using these embeddings to train Diffusion models for image reconstruction. Typical methods include MetaMorph[8], Nexus-Gen[9], and Blip-3o[10]. Their typical architecture is shown below:

image/png

As shown in figure b above, MetaMorph and Nexus-Gen use Image Loss (usually MSE or cosine similarity loss) to supervise AR models, making them learn to predict semantic embeddings of target images. This is done partly to answer why unified models are needed from a Joint Training perspective. In MetaMorph's experiments, image understanding and generation tasks trained together from scratch can mutually promote each other's effectiveness. On the other hand, Nexus-Gen's Unified Image Embedding Space models understanding and generation as inverse tasks, with potential benefits of directly understanding generated embeddings, thus having multi-round reasoning potential. Additionally, there's an issue in this sub-route: supervising autoregressive models to predict continuous image embeddings leads to serious error accumulation problems. MetaMorph ignores this phenomenon, while Nexus-Gen adopts a prefilling autoregressive strategy to solve it, which is essentially consistent with Learnable Query from other works (Blip-3o, MetaQuery[11]).

Blip-3o also uses similar embedding supervision training ideas, but they additionally use Flow Matching for distribution modeling on semantic embeddings, addressing the inability to do distribution modeling in autoregressive models within this architecture, as shown below:

image/png

2.2 Methods Directly Training Diffusion Models

These methods generally freeze AR models and directly use hidden states output by AR models as conditions for Diffusion models, training only Diffusion for image generation. From another perspective, methods in this route can be seen as evolution of Diffusion technology, replacing commonly used T5 Text Encoders in Diffusion models with larger multimodal generative models (e.g., Qwen2.5-VL-7B). Typical methods include Uniworld[12], MetaQuery (special note: MetaQuery uses Learnable Query for condition extraction, not hidden states), Qwen-Image[13], and OmniGen2[14]. For image generation tasks, their typical architecture is shown below (source: Qwen-Image), where text prompts input to Qwen2.5-Vl directly output hidden states corresponding to these tokens as text conditions for subsequent Diffusion Transformers.

image/png

Besides image generation, a potential of unified models lies in their image editing capabilities. Therefore, we additionally analyze the image editing architectures of these models. Compared to image generation tasks, using Diffusion models for image editing has an additional input condition: encoding information of the image to be edited. Images to be edited can use two types of encoding: semantic encoding and reconstruction encoding.

  1. Semantic Encoding Architecture: Taking Uniworld as an example, the architecture using semantic encoding is shown below, focusing on the SigLIP part. As shown, images to be edited directly pass through SigLIP and MLP before being input as conditions into DiT. Nexus-Gen also supports image editing using similar semantic encoding feature condition injection architecture. From Nexus-Gen's experimental experience in image editing, semantic feature encoding compared to VAE encoding has encoding space closer to VLM output semantic space, requiring only small amounts of data training to establish relationships between text and image conditions, with potential advantages in better instruction following capability. The disadvantage of this encoding architecture is that semantic encoding has information loss, reconstruction effects are strongly correlated with the number of encoding tokens, and often cannot achieve one-to-one reconstruction. From image editing reconstruction effects, GPT4o-Image likely also uses semantic encoding.

image/png

  1. VAE Encoding Architecture: Qwen-Image and OmniGen2 use this architecture. Looking further, previously open-sourced Step1X-Edit and Flux-Kontext use identical architectures. Looking even further, this architecture is consistent with approaches of In-Context LoRA and OmniControl. Taking Qwen-Image as an example, the architecture is shown below. Focus on Input Image, which after VAE Encoder serves as conditions input to DiT. In this architecture, positional encoding is generally used to distinguish input images from denoised images, like Qwen-Image[15] and Flux-Kontext[16] directly distinguish in the first dimension (frame id) of positional encoding.

image/png

AR+Diffusion Parallel Structure

The above serial structure uses embeddings as bridges between AR and Diffusion, while the parallel structure here uses Attention as bridges between AR and Diffusion models. Typical works include LlamaFusion[17] and Bagel[18]. Using Bagel's definition, this architecture can be called Mixture-of-Transformer-Experts (MoT).

3.1 LlamaFusion Freezes Text Model

LlamaFusion's architecture is shown below. Given a language model (left figure), authors copy its parameters to serve as image generation-specific parameters (right figure). For sequences containing text and noise images, text tokens use left figure parameters for computation, while image tokens use right figure parameters for training, but during attention computation, all tokens are concatenated together for self-attention. Since the language model is frozen, this architecture cannot change the model's understanding capability, so it doesn't involve encoding issues in image understanding tasks. Image generation uses VAE for both encoding and decoding. Although the model uses language model structure, it actually uses Diffusion route for image generation.

image/png

3.2 Bagel Mixed Image-Text Training

Bagel uses an architecture similar to LlamaFusion, but differs in that the model's image understanding and generation capabilities are all trained from scratch. The model's image understanding uses semantic encoding models like SigLIP, while image generation uses reconstruction models like VAE for encoding and decoding. Understanding tasks use AR method to autoregressively generate text tokens, while generation tasks use Diffusion method to generate VAE features of images.

image/png

Strictly speaking, Bagel is the first unified model to conduct ultra-large-scale pretraining. Its uniqueness lies in truly conducting base model-level mixed modal data training (the previous one was Chameleon), and the paper also mentions emerging capabilities under such settings.

Single Model Simultaneously Doing AR+Diffusion

Unlike the previous AR+Diffusion routes, AR and Diffusion here refer to loss function definitions. The idea of this route is to use only one Transformer model for sequence modeling and distribution modeling. In the same sequence, text tokens use AR's NTP loss for sequence modeling, while image tokens use Diffusion's loss function to learn image distribution. Typical methods include Transfusion[19], Show-O[20], Show-O2[21].

Taking Transfusion as an example, its architecture is shown below. The model uses a 7B Transformer model for unified sequence and distribution modeling, doing sequence modeling for text tokens and distribution modeling for image tokens. Image understanding and generation tasks both use VAE features for image encoding. Show-O and Show-O2 use similar architectures, except they only have a lightweight Flow Head for image denoising when doing Diffusion, which won't be analyzed further here.

image/png

Summary

Summarizing the work from the above routes, some conclusions that can currently be drawn are:

  1. For image understanding tasks, image tokenizers should use semantic encoders like SigLIP. For image generation tasks, VAE provides better detail reconstruction.

  2. At least one stage in the image generation process should do image distribution modeling to ensure better image generation quality.

From these two conclusions, using only the AR route is insufficient for unified models. This is why, after Janus-Pro, not many models following the exact same architecture have been open-sourced. Subsequent similar latest works, such as Show-O2, X-Omni[22], or NextStep[23], all use at least lightweight Flow Heads or relatively large Diffusion Transformers for image generation.

For other technical routes of unified models, currently the AR+Diffusion serial route seems to be the most stable and easiest to achieve good results. In fact, training data is the core of model effectiveness - Data is all you need. Models that can be widely recognized and used are those without obvious architectural problems and with well-prepared training data. With different training data, it's unrealistic to reflect architectural advantages and disadvantages through model benchmark metrics, especially since image generation effects have bias with existing evaluation metrics. Therefore, there may currently be no clear conclusions proving that certain architectures are better than others.

But current unified models still need to answer a core question: whether the potential for mutual task promotion truly exists, whether understanding and generation capabilities can mutually promote each other, and whether unified understanding and generation can achieve 1+1>2 effects. Without discussing these issues and just making unified models for the sake of making unified models, it's difficult to truly train useful models. A reflective example is that many works use Qwen2.5-VL-7B as image understanding base models and freeze the base models, but when evaluating image understanding capabilities, many different evaluation values appear. Whether 1+1 can be greater than 2 requires some works that have undergone large-scale training to verify, such as Bagel verifying the effectiveness of large-scale pretraining with interleaved image-text data, and Qwen-Image's release proving that better text encoding brings significant gains to generation and editing - this is a good start.

Despite the above issues, the unified model direction remains one that both academia and industry will closely follow. Its unified narrative aligns well with everyone's vision of AGI, and the gains that unification brings to generation effects have been proven by cutting-edge base models like Qwen-Image. So let's continue to closely follow the development of unified models and witness the evolution of understanding and generation!

References:

[1] Taming Transformers for High-Resolution Image Synthesis

[2] World Model on Million-Length Video And Language With Blockwise Ring Attention

[3] Chameleon: Mixed-Modal Early-Fusion Foundation Models

[4] Emu3: Next-Token Prediction is All You Need

[5] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

[6] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

[7] Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

[8] MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

[9] Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space

[10] BLIP3-o: A Family of Fully Open Unified Multimodal Models -- Architecture, Training and Dataset

[11] Transfer between Modalities with Meta Queries

[12] UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

[13] Qwen-Image Technical Report

[14] OmniGen2: Exploration to Advanced Multimodal Generation

[15] Step1X-Edit: A Practical Framework for General Image Editing

[16] FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

[17] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

[18] Emerging Properties in Unified Multimodal Pretraining

[19] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

[20] Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

[21] Show-o2: Improved Native Unified Multimodal Models

[22] X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

[23] NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Community

Sign up or log in to comment