Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode
Abstract
MLLMSeg integrates MLLM vision encoder and LLM features with a lightweight mask decoder to achieve high accuracy in reference expression segmentation with reduced computational cost.
Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance (2025)
- FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation (2025)
- Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder (2025)
- TASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models (2025)
- HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation (2025)
- OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts (2025)
- MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper