Knowledge to Sight (K2Sight)
Knowledge to Sight (K2Sight) is a novel framework designed for grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. Unlike generalist Vision-Language Models (VLMs) that often struggle with domain-specific medical terms, K2Sight introduces structured semantic supervision. It achieves this by decomposing clinical concepts into interpretable visual attributes like shape, density, and anatomical location, distilled from domain ontologies.
This approach guides region-text alignment during training, enabling data-efficient training of compact models (0.23B and 2B parameters) using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, K2Sight models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$.
- Paper: Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding
- Project Page: https://lijunrio.github.io/K2Sight/
- Code: https://github.com/LijunRio/AG-KD
- Demo: https://huggingface.co/spaces/RioJune/AG-KD
Usage
This model can be easily integrated and used for zero-shot abnormality grounding in medical images.
First, install the necessary dependencies:
pip install transformers Pillow
# For full project dependencies and further setup, refer to the official GitHub repository.
Here's a basic example of how to use the model for abnormality grounding:
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
# Load model and processor
model_id = "RioJune/AG-KD"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Example image (replace with your medical image path)
# Ensure 'your_medical_image.png' exists in your directory or provide a full path.
image = Image.open("path/to/your/medical_image.png").convert("RGB")
# Example instruction for abnormality grounding
# The model expects instructions to start with specific tokens like <OD> for object detection.
instruction = "<OD> Please localize the lesion. "
# Prepare inputs
inputs = processor(images=image, text=instruction, return_tensors="pt")
# Generate output
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3,
)
# Decode and print the result
output_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
print(f"Instruction: {instruction}")
print(f"Detected abnormality: {output_text}")
# The output_text will contain bounding box coordinates (e.g., <loc_000><loc_001><loc_002><loc_003>)
# and a description of the localized finding.
For more advanced usage, including training and evaluation scripts, please refer to the official GitHub repository.
Citation
If you find our work helpful or inspiring, please cite our paper:
@article{li2025enhancing,
title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
journal={arXiv preprint arXiv:2503.03278},
year={2025}
}
- Downloads last month
- 31