Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis
Abstract
The success of deep learning in computer vision over the past decade has hinged on large labeled datasets and strong pretrained models. In data-scarce settings, the quality of these pretrained models becomes crucial for effective transfer learning. Image classification and self-supervised learning have traditionally been the primary methods for pretraining CNNs and transformer-based architectures. Recently, the rise of text-to-image generative models, particularly those using denoising diffusion in a latent space, has introduced a new class of foundational models trained on massive, captioned image datasets. These models' ability to generate realistic images of unseen content suggests they possess a deep understanding of the visual world. In this work, we present Marigold, a family of conditional generative models and a fine-tuning protocol that extracts the knowledge from pretrained latent diffusion models like Stable Diffusion and adapts them for dense image analysis tasks, including monocular depth estimation, surface normals prediction, and intrinsic decomposition. Marigold requires minimal modification of the pre-trained latent diffusion model's architecture, trains with small synthetic datasets on a single GPU over a few days, and demonstrates state-of-the-art zero-shot generalization. Project page: https://marigoldcomputervision.github.io
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception (2025)
- ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration (2025)
- Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing (2025)
- Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution (2025)
- SupResDiffGAN a new approach for the Super-Resolution task (2025)
- CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation (2025)
- DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper