ICLR25: Incorporating Visual Correspondence into Diffusion Model for Visual Try-On

<h1>ICLR25: Incorporating Visual Correspondence into Diffusion Model for Visual Try-On</h1>
This is the official repository for the 
[Paper](*) 
"Incorporating Visual Correspondence into Diffusion Model for Visual Try-On"

## Overview
We novelly propose to explicitly capitalize
on visual correspondence as the prior to tame diffusion process instead of simply
feeding the whole garment into UNet as the appearance reference.
## Installation
Create a conda environment & Install requirments
```
conda create -n SPM-Diff python==3.9.0
conda activate SPM-Diff
cd SPM-Diff-main 
pip install -r requirements.txt
```
## Semantic Point Matching
In SPM, a set of semantic points on the garment are first sampled and matched to the
corresponding points on the target person via local flow warping. Then, these 2D cues are augmented
into 3D-aware cues with depth/normal map, which act as semantic point matching to supervise
diffusion model.

You can directly download the [Semantic Point Feature](*) or follow the instructions in [preprocessing.md](*) to extract the Semantic Point Feature yourself.

## Dataset
You can download the VITON-HD dataset from [here](https://github.com/xiezhy6/GP-VTON) <br>
For inference, the following dataset structure is required: <br>
```
test
|-- image
|-- masked_vton_img 
|-- warp-cloth
|-- cloth
|-- cloth_mask
|-- point
```
## Inference
Please download the pre-trained model from [Google Link](*)
```
sh inference.sh
```
## Acknowledgement
Thanks the contribution of [LaDI-VTON](https://github.com/miccunifi/ladi-vton) and [GP-VTON](https://github.com/xiezhy6/GP-VTON).