The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models
Abstract
Transformer-based text-to-image diffusion models show varying degrees of content-style separation in generated artworks, as revealed by cross-attention heatmaps.
Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.
Community
This research investigates how text-to-image diffusion models internally represent artistic concepts like content and style when generating artworks. Using cross-attention analysis, we examine how these models separate content-describing and style-describing elements in prompts. Our findings reveal that diffusion models show varying degrees of content-style separation, with content tokens typically influencing object regions and style tokens affecting backgrounds and textures.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models (2025)
- Style Composition within Distinct LoRA modules for Traditional Art (2025)
- StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization (2025)
- Blending Concepts with Text-to-Image Diffusion Models (2025)
- Only-Style: Stylistic Consistency in Image Generation without Content Leakage (2025)
- Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models (2025)
- LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper