metadata

license: apache-2.0
datasets:
  - ILSVRC/imagenet-1k
  - bentrevett/caltech-ucsd-birds-200-2011
  - vaishaal/ImageNetV2
  - clip-benchmark/wds_imagenet_sketch
  - clip-benchmark/wds_imagenet-r
  - enterprise-explorers/oxford-pets
  - ethz/food101
  - clip-benchmark/wds_imagenet-a
language:
  - en
metrics:
  - accuracy
base_model:
  - openai/clip-vit-large-patch14
  - openai/clip-vit-base-patch32
pipeline_tag: zero-shot-image-classification
tags:
  - code

LaZSL

This repository contains the code for the ICCV'25 paper titled with "Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model".

Pre-print version at [arXiv]

Requirements

First install the dependencies.

Either manually:

conda install pytorch torchvision -c pytorch
conda install matplotlib torchmetrics -c conda-forge

Preparing Dataset

Please follow the instructions DATASETS.md to construct the datasets.

Running

To reproduce accuracy results from the paper: edit the directories to match your local machine in load_OP.py and set hparams['dataset'] accordingly. Then simply run python main_OP.py. All hyperparameters can be modified in load_OP.py.

Results

Results of our released models using various evaluation protocols on 6 datasets.

Dataset	Acc(ViT-B/32)	Acc(ViT-B/16)	Acc(ViT-L/14)
Imagenet	65.3	69.2	75.7
CUB	56.5	60.3	66.1
OxfordPets	84.7	87.4	92.7
Food101	85.9	89.7	93.5
Place365	41.5	42.0	41.8

Citation

If you find LaZSL is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@inproceedings{chen2025interpretable,
  title={Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model},
  author={Chen, Shiming and Duan, Bowen and Khan, Salman and Khan, Fahad Shahbaz},
  booktitle={ICCV}
  year={2025}
}