---
license: apache-2.0
datasets:
- ILSVRC/imagenet-1k
- bentrevett/caltech-ucsd-birds-200-2011
- vaishaal/ImageNetV2
- clip-benchmark/wds_imagenet_sketch
- clip-benchmark/wds_imagenet-r
- enterprise-explorers/oxford-pets
- ethz/food101
- clip-benchmark/wds_imagenet-a
language:
- en
metrics:
- accuracy
base_model:
- openai/clip-vit-large-patch14
- openai/clip-vit-base-patch32
pipeline_tag: zero-shot-image-classification
tags:
- code
---
# LaZSL
This repository contains the code for the ICCV'25 paper titled with  "***Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model***".

Pre-print version at [[arXiv]](https://arxiv.org/pdf/2506.23822)

## Requirements
First install the dependencies.

Either manually:
```
conda install pytorch torchvision -c pytorch
conda install matplotlib torchmetrics -c conda-forge
```

## Preparing Dataset
Please follow the instructions [DATASETS.md](https://github.com/KaiyangZhou/CoOp/blob/main/DATASETS.md) to construct the datasets.

## Running

To reproduce accuracy results from the paper: edit the directories to match your local machine in `load_OP.py` and set `hparams['dataset']` accordingly. Then simply run `python main_OP.py`.
All hyperparameters can be modified in `load_OP.py`.

## Results
Results of our released models using various evaluation protocols on 6 datasets.


| Dataset | Acc(ViT-B/32) | Acc(ViT-B/16) | Acc(ViT-L/14) |
| :-----: | :-----: | :-----: | :-----: |
| Imagenet | 65.3 | 69.2| 75.7 |
| CUB | 56.5 | 60.3 | 66.1 |
| OxfordPets | 84.7 | 87.4 | 92.7 |
| Food101 | 85.9 | 89.7 | 93.5 |
| Place365 | 41.5 | 42.0 | 41.8 | 

## Citation
If you find LaZSL is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

```bibtex
@inproceedings{chen2025interpretable,
  title={Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model},
  author={Chen, Shiming and Duan, Bowen and Khan, Salman and Khan, Fahad Shahbaz},
  booktitle={ICCV}
  year={2025}
}
```