Papers
arxiv:2501.16769

Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models

Published on Jan 28
Authors:

Abstract

Self-supervised learning can resolve numerous image or linguistic processing problems when effectively trained. This study investigated simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposed "Beyond-Labels," a lightweight transformer-based fusion module that uses a handful of image segmentation data to fuse frozen image representations with language concepts. This strategy allows the model to successfully actualize enormous knowledge from pretrained models without requiring extensive retraining, making the model data-efficient and scalable. Furthermore, we efficiently captured positional information in images using Fourier embeddings, thus improving the generalization across various image sizes, addressing one of the key limitations of previous methods. Extensive ablation tests were performed to investigate the important components of our proposed method; when tested against the common benchmark PASCAL-5i, it demonstrated superior performance despite being trained on frozen image and language characteristics.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.16769 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.16769 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.16769 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.