Papers
arxiv:2506.21476

Global and Local Entailment Learning for Natural World Imagery

Published on Jun 26
· Submitted by Srikumar26 on Jun 30
Authors:
,
,
,

Abstract

Radial Cross-Modal Embeddings enable explicit modeling of transitive entailment in vision-language models, leading to improved performance in hierarchical species classification and retrieval tasks.

AI-generated summary

Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

Community

Paper author Paper submitter
edited 8 days ago

Accepted to ICCV 2025!

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.21476 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.21476 in a Space README.md to link it from this page.

Collections including this paper 2