arxiv:2508.04280

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

Published on Aug 6

· Submitted by

kefirski on Aug 7

Upvote

Authors:

George Bredis ,

Daniil Gavrilov

Abstract

A lightweight, hyperparameter-free RL algorithm, VL-DAC, enables VLMs to learn generalized policies from inexpensive simulators, improving performance on real-world benchmarks without sacrificing image understanding accuracy.

AI-generated summary

Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

View arXiv page View PDF GitHub 3 Add to collection

Community

kefirski

Paper author Paper submitter 1 day ago

This paper introduces VL-DAC (Vision-Language Decoupled Actor-Critic), a reinforcement learning algorithm designed to train vision-language models (VLMs) as interactive agents in synthetic environments. The key innovation is decoupling the learning process by applying token-wise PPO updates for actions while computing value loss only at the environment-step level with gradients stopped at the VLM backbone. This approach eliminates the brittle hyperparameter tuning required by previous methods like RL4VLM and avoids the credit assignment problems of sequence-level methods like LOOP. We demonstrate that training a single VLM with VL-DAC in lightweight simulators (MiniWorld, ALFWorld, WebShop) produces policies that transfer effectively to benchmarks. Crucially, the combination of a robust, easy-to-deploy algorithm with the ability to acquire diverse skills across different environments opens a path toward environment scaling and comprehensive learning from experience.