metadata

base_model: black-forest-labs/FLUX.1-dev
license: other
license_name: compass-lora-weights-nc-license
license_link: LICENSE
pipeline_tag: text-to-image
library_name: diffusers
tags:
  - text-to-image
  - lora
  - diffusers
  - template:diffusion-lora
widget:
  - text: a photo of a laptop above a dog
    output:
      url: images/laptop-above-dog.jpg
  - text: a photo of a bird below a skateboard
    output:
      url: images/bird-below-skateboard.jpg
  - text: a photo of a horse to the left of a bottle
    output:
      url: images/horse-left-bottle.jpg

CoMPaSS-FLUX.1: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Project Page | Code | arXiv

Prompt
a photo of a bird below a skateboard

Prompt
a photo of a horse to the left of a bottle

Model description

A LoRA adapter that enhances spatial understanding capabilities of the FLUX.1 text-to-image diffusion model. This model, presented in CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models, demonstrates significant improvements in generating images with specific spatial relationships between objects.

Model Details

Base Model: FLUX.1-dev
LoRA Rank: 16
Training Data: SCOP dataset (curated from COCO)
File Size: ~50MiB
Framework: Diffusers
License: Non-Commercial (see ./LICENSE)

Intended Use

Generating images with accurate spatial relationships between objects
Creating compositions that require specific spatial arrangements
Enhancing the base model's spatial understanding while maintaining its other capabilities

Performance

Key Improvements

VISOR benchmark: +98% relative improvement
T2I-CompBench Spatial: +67% relative improvement
GenEval Position: +131% relative improvement
Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model)

Using the Model

See our GitHub repository to get started.

Effective Prompting

The model works well with:

Clear spatial relationship descriptors (left, right, above, below)
Pairs of distinct objects
Explicit spatial relationships (e.g., "a photo of A to the right of B")

Training Details

Training Data

Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
~28,000 curated object pairs from COCO
Enforces criteria for:
- Visual significance
- Semantic distinction
- Spatial clarity
- Object relationships
- Visual balance

Training Process

Trained for 24,000 steps
Batch size of 4
Learning rate: 1e-4
Optimizer: AdamW with β₁=0.9, β₂=0.999
Weight decay: 1e-2

Evaluation Results

Metric	FLUX.1	+CoMPaSS
VISOR uncond (⬆️)	37.96%	75.17%
T2I-CompBench Spatial (⬆️)	0.18	0.30
GenEval Position (⬆️)	0.26	0.60
FID (⬇️)	27.96	26.40
CMMD (⬇️)	0.8737	0.6859

Citation

If you use this model in your research, please cite:

@inproceedings{zhang2025compass,
  title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
  author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
  booktitle={ICCV},
  year={2025}
}

blurgy
/

CoMPaSS-FLUX.1