Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation
Abstract
Subject Fidelity Optimization (SFO) enhances zero-shot subject-driven generation by introducing synthetic negative samples and optimizing diffusion timesteps, outperforming existing methods in subject fidelity and text alignment.
We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Beyond supervised fine-tuning methods that rely only on positive targets and use the diffusion loss as in the pre-training stage, SFO introduces synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically generates distinctive and informative negatives by intentionally degrading visual and textual cues without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus finetuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: https://subjectfidelityoptimization.github.io/
Community
We introduce Subject Fidelity Optimization (SFO) which enhances subject fidelity in zero-shot subject-driven text-to-image generation by introducing negative targets and a comparison-based learning signal, explicitly guiding the model on which aspects are desirable and which are not.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation (2025)
- AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment (2025)
- In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation (2025)
- Subject-driven Video Generation via Disentangled Identity and Motion (2025)
- MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval (2025)
- RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation (2025)
- Flux Already Knows - Activating Subject-Driven Image Generation without Training (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper