--- license: apache-2.0 base_model: - OpenGVLab/InternVL2_5-26B --- # IDMR-26B **IDMR** is a universal multimodal embedding model, particularly well-suited for **Instance-Driven Multimodal Retrieval (IDMR)** tasks. It is designed to achieve fine-grained, instance-level visual correspondence across modalities. --- ### 🔍 Learn More About IDMR - 📄 Paper: [IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval](https://arxiv.org/pdf/2504.00954) - 🤗 Demo: [IDMR Demo on Hugging Face Spaces](https://huggingface.co/spaces/lbw18601752667/IDMR-demo) - 💻 Code: [Github](https://github.com/BwLiu01/IDMR) ## 🚀 Usage To get started, clone the GitHub repository and install the required dependencies: ```bash git clone https://github.com/BwLiu01/IDMR.git cd IDMR pip install -r requirements.txt ``` ```python import torch import numpy as np from PIL import Image from src.model import IDMRModel from src.vlm_backbone.intern_vl import InternVLProcessor from src.arguments import ModelArguments from transformers import AutoTokenizer, AutoImageProcessor device = "cuda" IMAGE_TOKEN = "" # Load model and processor model_args = ModelArguments(model_name="lbw18601752667/IDMR-26B", model_backbone="internvl_2_5") # Initialize processor tokenizer = AutoTokenizer.from_pretrained(model_args.model_name, trust_remote_code=True) image_processor = AutoImageProcessor.from_pretrained(model_args.model_name, trust_remote_code=True, use_fast=False) processor = InternVLProcessor(image_processor=image_processor, tokenizer=tokenizer) # Load model model = IDMRModel.load(model_args).to(device, dtype=torch.bfloat16).eval() def get_embedding(text, image=None, type="qry"): """Get embedding for text and/or image input""" inputs = processor( text=f"{IMAGE_TOKEN}\n {text}" if text else f"{IMAGE_TOKEN}\n Represent the given image.", images=[image] if image else None, return_tensors="pt", max_length=1024, truncation=True ) inputs = {key: value.to(device) for key, value in inputs.items()} inputs["image_flags"] = torch.tensor([1 if image else 0], dtype=torch.long).to(device) with torch.no_grad(), torch.autocast(device_type=device, dtype=torch.bfloat16): if type == "qry": output = model(qry=inputs)["qry_reps"] else: output = model(tgt=inputs)["tgt_reps"] return output.float() # Query query_text = "your query text" query_image = Image.open("your query image path") query_embedding = get_embedding(query_text, query_image, type="qry") # Target target_image = Image.open("your target image path") target_embedding = get_embedding(None, target_image, type="tgt") print(model.compute_similarity(query_embedding, target_embedding)) ```