Abstract
A new method using a pre-trained generative model helps construct a high-impact SFT dataset, Alchemist, which improves the generative quality of text-to-image models while maintaining diversity.
Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.
Community
Hi everyone! We are thrilled to announce our latest work on improving text-to-image models through smarter dataset curation! 🎨✨
While pre-trained T2I models have broad knowledge, achieving high-quality outputs often requires fine-tuning on carefully curated data. But how do we identify the most impactful samples without costly manual effort? Our paper introduces a novel method that leverages a generative model to estimate high-value training data, resulting in Alchemist—a compact (3,350 samples) yet powerful general-purpose SFT dataset. Feel free to use it from our repo: https://huggingface.co/datasets/yandex/alchemist.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation (2025)
- SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization (2025)
- Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources (2025)
- Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects (2025)
- Improving Physical Object State Representation in Text-to-Image Generative Systems (2025)
- Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition (2025)
- InstructEngine: Instruction-driven Text-to-Image Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper