RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning
Abstract
RePrompt, a reprompting framework using reinforcement learning, enhances text-to-image generation by optimizing for image-level outcomes, significantly improving spatial layout and compositional generalization.
Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning (2025)
- Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation (2025)
- ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models (2025)
- T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT (2025)
- Improving Physical Object State Representation in Text-to-Image Generative Systems (2025)
- UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning (2025)
- VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper