arxiv:2604.25636

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

Published on Apr 28

· Submitted by

Jiayi Guo on Apr 29

Tsinghua-LeapLab

Upvote

Authors:

Abstract

Refinement via Regeneration (RvR) improves multimodal model refinement by reformulating the process as conditional image regeneration instead of editing, achieving better semantic alignment and higher evaluation scores.

AI-generated summary

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

View arXiv page View PDF GitHub 32 Add to collection

Community

JiayiGuo821

Paper submitter 1 day ago

🔥 Regeneration over editing: unlocking more effective image refinement!

We present Refinement via Regeneration (RvR), a novel framework that reformulates image refinement in unified multimodal models from an editing-based paradigm to a regeneration-based one. Instead of relying on intermediate editing instructions and enforcing pixel-level consistency, our method directly regenerates images conditioned on the target prompt and semantic representations of the initial image, thereby enlarging the effective modification space. This design enables more complete semantic alignment and avoids error accumulation from coarse instructions, leading to more flexible and accurate refinement.