Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
Abstract
Refinement via Regeneration (RvR) improves multimodal model refinement by reformulating the process as conditional image regeneration instead of editing, achieving better semantic alignment and higher evaluation scores.
Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.
Community
🔥 Regeneration over editing: unlocking more effective image refinement!
We present Refinement via Regeneration (RvR), a novel framework that reformulates image refinement in unified multimodal models from an editing-based paradigm to a regeneration-based one. Instead of relying on intermediate editing instructions and enforcing pixel-level consistency, our method directly regenerates images conditioned on the target prompt and semantic representations of the initial image, thereby enlarging the effective modification space. This design enables more complete semantic alignment and avoids error accumulation from coarse instructions, leading to more flexible and accurate refinement.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details (2026)
- Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning (2026)
- InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning (2026)
- SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing (2026)
- RefAlign: Representation Alignment for Reference-to-Video Generation (2026)
- FluSplat: Sparse-View 3D Editing without Test-Time Optimization (2026)
- SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.25636 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper