Abstract
A unified image editing framework, IEAP, built on Diffusion Transformer (DiT) decomposes complex editing instructions into operations performed by vision-language models for robust editing across various tasks.
While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ImgEdit: A Unified Image Editing Dataset and Benchmark (2025)
- MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection (2025)
- SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing (2025)
- DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing (2025)
- In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer (2025)
- Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions (2025)
- RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend