arxiv:2503.23135

LSNet: See Large, Focus Small

Published on Mar 29

· Submitted by

jameslahm on Apr 3

Upvote

Authors:

Ao Wang ,

Hui Chen ,

Abstract

Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (Large-Small) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

jameslahm

Paper author Paper submitter 3 days ago

This comment has been hidden (marked as Resolved)

jameslahm

Paper author Paper submitter about 22 hours ago

•

edited about 22 hours ago

[CVPR 2025] LSNet: See Large, Focus Small 🤗
We introduce LSNet, a new family of lightweight vision models inspired by the dynamic heteroscale capability of the human visual system, i.e., "See Large, Focus Small". LSNet achieves state-of-the-art performance and efficiency trade-offs across various vision tasks.
Code: https://github.com/THU-MIG/lsnet
Models: https://huggingface.co/collections/jameslahm/lsnet-67ebec0ab4e220e7918d9565

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.23135 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.23135 in a Space README.md to link it from this page.