arxiv:2505.05678

InstanceGen: Image Generation with Instance-level Instructions

Published on May 8

· Submitted by

etaisella on May 19

Upvote

Authors:

Etai Sella ,

Abstract

The proposed technique combines image-based structural guidance with LLM-based instructions to align generated images with complex, detailed text prompts.

AI-generated summary

Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.

View arXiv page View PDF Project page Add to collection

Community

etaisella

Paper author Paper submitter 5 days ago

We introduce InstanceGen, an inference time technique which improves diffusion model's ability to generate images for complex prompts involving multiple objects, instance level attributes and spatial relationships.

Abstract:
Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances. Additionally, we contribute CompoundPrompts, a benchmark composed of complex prompts with three difficulty levels in which object instances are progressively compounded with attribute descriptions and spatial relations. Extensive experiments demonstrate that our method significantly surpasses the performance of prior models, particularly over complex multi-object and multi-attribute use cases.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.05678 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.05678 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.05678 in a Space README.md to link it from this page.