HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents
Abstract
HyCodePolicy integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair to enhance the robustness and efficiency of embodied agent policies.
Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.
Community
Robots can now write, mess up, watch themselves mess up, and fix it—all without crying for help. HyCodePolicy lets bots use vision + code to debug like pros.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation (2025)
- CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs (2025)
- Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning (2025)
- Foundation Model Driven Robotics: A Comprehensive Review (2025)
- Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning (2025)
- GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System (2025)
- MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper