--- license: apache-2.0 datasets: - ILSVRC/imagenet-1k - bentrevett/caltech-ucsd-birds-200-2011 - vaishaal/ImageNetV2 - clip-benchmark/wds_imagenet_sketch - clip-benchmark/wds_imagenet-r - enterprise-explorers/oxford-pets - ethz/food101 - clip-benchmark/wds_imagenet-a language: - en metrics: - accuracy base_model: - openai/clip-vit-large-patch14 - openai/clip-vit-base-patch32 pipeline_tag: zero-shot-image-classification tags: - code --- # LaZSL This repository contains the code for the ICCV'25 paper titled with "***Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model***". Pre-print version at [[arXiv]](https://arxiv.org/pdf/2506.23822) ## Requirements First install the dependencies. Either manually: ``` conda install pytorch torchvision -c pytorch conda install matplotlib torchmetrics -c conda-forge ``` ## Preparing Dataset Please follow the instructions [DATASETS.md](https://github.com/KaiyangZhou/CoOp/blob/main/DATASETS.md) to construct the datasets. ## Running To reproduce accuracy results from the paper: edit the directories to match your local machine in `load_OP.py` and set `hparams['dataset']` accordingly. Then simply run `python main_OP.py`. All hyperparameters can be modified in `load_OP.py`. ## Results Results of our released models using various evaluation protocols on 6 datasets. | Dataset | Acc(ViT-B/32) | Acc(ViT-B/16) | Acc(ViT-L/14) | | :-----: | :-----: | :-----: | :-----: | | Imagenet | 65.3 | 69.2| 75.7 | | CUB | 56.5 | 60.3 | 66.1 | | OxfordPets | 84.7 | 87.4 | 92.7 | | Food101 | 85.9 | 89.7 | 93.5 | | Place365 | 41.5 | 42.0 | 41.8 | ## Citation If you find LaZSL is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry. ```bibtex @inproceedings{chen2025interpretable, title={Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model}, author={Chen, Shiming and Duan, Bowen and Khan, Salman and Khan, Fahad Shahbaz}, booktitle={ICCV} year={2025} } ```