--- base_model: - SherryXTChen/LatentDiffusionDINOv2 datasets: - timbrooks/instructpix2pix-clip-filtered - SherryXTChen/InstructCLIP-InstructPix2Pix-Data language: - en license: apache-2.0 pipeline_tag: image-to-image library_name: diffusers tags: - model_hub_mixin - pytorch_model_hub_mixin --- # InstructCLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning (CVPR 2025) This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration. The model is based on the paper [Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning](https://huggingface.co/papers/2503.18406). [Arxiv](http://arxiv.org/abs/2503.18406) | [Image Editing Model](https://huggingface.co/SherryXTChen/InstructCLIP-InstructPix2Pix) | [Data Refinement Model](https://huggingface.co/SherryXTChen/Instruct-CLIP) | [Data](https://huggingface.co/datasets/SherryXTChen/InstructCLIP-InstructPix2Pix-Data) ## Capabilities <p align="center"> <img src="https://raw.githubusercontent.com/SherryXTChen/Instruct-CLIP/refs/heads/main/assets/teaser_2.png" alt="Figure 2" width="50%"> </p> ## Installation ``` pip install -r requirements.txt ``` ## Edit Instruction Refinement Inference ```python from PIL import Image import torch from torchvision import transforms from model import InstructCLIP from utils import get_sd_components, normalize parser = argparse.ArgumentParser(description="Simple example of estimating edit instruction from image pair") parser.add_argument( "--pretrained_instructclip_name_or_path", type=str, default="SherryXTChen/Instruct-CLIP", help=( "instructclip pretrained checkpoints" ), ) parser.add_argument( "--pretrained_model_name_or_path", type=str, default="runwayml/stable-diffusion-v1-5", help=( "sd pretrained checkpoints" ), ) parser.add_argument( "--input_path", type=str, default="assets/1_input.jpg", help=( "Input image path" ) ) parser.add_argument( "--output_path", type=str, default="assets/1_output.jpg", help=( "Output image path" ) ) args = parser.parse_args() device = "cuda" # load model for edit instruction estimation model = InstructCLIP.from_pretrained("SherryXTChen/Instruct-CLIP") model = model.to(device).eval() # load model to preprocess/encode image to latent space tokenizer, _, vae, _, _ = get_sd_components(args, device, torch.float32) # prepare image input transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean=[0.5], std=[0.5]), ]) image_list = [args.input_path, args.output_path] image_list = [ transform(Image.open(f).resize((512, 512))).unsqueeze(0).to(device) for f in image_list ] with torch.no_grad(): image_list = [vae.encode(x).latent_dist.sample() * vae.config.scaling_factor for x in image_list] # get image feature zero_timesteps = torch.zeros_like(torch.tensor([0])).to(device) img_feat = model.get_image_features( inp=image_list[0], out=image_list[1], inp_t=zero_timesteps, out_t=zero_timesteps) img_feat = normalize(img_feat) # get edit instruction pred_instruct_input_ids = model.text_decoder.infer(img_feat[:1])[0] pred_instruct = tokenizer.decode(pred_instruct_input_ids, skip_special_tokens=True) print(pred_instruct) # as a 3 d sculpture ``` ## Citation ```bibtex @misc{chen2025instructclipimprovinginstructionguidedimage, title={Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning}, author={Sherry X. Chen and Misha Sra and Pradeep Sen}, year={2025}, eprint={2503.18406}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.18406}, } ```