VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
Abstract
Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.
Community
Project page: https://visualcloze.github.io/
[Paper] ā [Online Demo] ā [Project Page] ā
[š¤ Model Card] ā [š¤ Graph200K Dataset Card]
An in-context learning based universal image generation framework that uses in-context examples as task demonstrations to guide the model in understanding and executing tasks.
- Support various in-domain tasks.
- Generalize to unseen tasks through in-context learning.
- Unify multiple tasks into one step and generate both target image and intermediate results (unseen during training).
- Support reverse generation, i.e., reverse-engineering a set of conditions from a target image (unseen during training).
š„ Examples are available on the project page.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models (2025)
- MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing (2025)
- Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision (2025)
- UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing (2025)
- UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer (2025)
- OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models (2025)
- DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Create a image of superman