Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
Abstract
A lightweight, hyperparameter-free RL algorithm, VL-DAC, enables VLMs to learn generalized policies from inexpensive simulators, improving performance on real-world benchmarks without sacrificing image understanding accuracy.
Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.
Community
This paper introduces VL-DAC (Vision-Language Decoupled Actor-Critic), a reinforcement learning algorithm designed to train vision-language models (VLMs) as interactive agents in synthetic environments. The key innovation is decoupling the learning process by applying token-wise PPO updates for actions while computing value loss only at the environment-step level with gradients stopped at the VLM backbone. This approach eliminates the brittle hyperparameter tuning required by previous methods like RL4VLM and avoids the credit assignment problems of sequence-level methods like LOOP. We demonstrate that training a single VLM with VL-DAC in lightweight simulators (MiniWorld, ALFWorld, WebShop) produces policies that transfer effectively to benchmarks. Crucially, the combination of a robust, easy-to-deploy algorithm with the ability to acquire diverse skills across different environments opens a path toward environment scaling and comprehensive learning from experience.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Simple"Try Again"Can Elicit Multi-Turn LLM Reasoning (2025)
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning (2025)
- The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs (2025)
- SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents (2025)
- VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning (2025)
- AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video (2025)
- Perception-Aware Policy Optimization for Multimodal Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper