arxiv:2505.24025

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Published on May 29

· Submitted by

vztu on Jun 2

Upvote

Authors:

Zhengzhong Tu ,

Abstract

DINO-R1 incorporates reinforcement learning to enhance visual in-context reasoning capabilities in vision foundation models, achieving better performance than supervised fine-tuning across various visual prompting scenarios.

AI-generated summary

The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose DINO-R1, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces Group Relative Query Optimization (GRQO), a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.

View arXiv page View PDF Add to collection

Community

vztu

Paper author Paper submitter 3 days ago

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ChenhongyiYang

1 day ago

It will be good if the authors can specify which "DINO" is it in the very beginning of the paper. Is it the feature extractor DINO? or the object detection DINO?

Anyway, thx for the great work!