Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
Abstract
MACT, a Multi-Agent Collaboration framework with Test-Time scaling, enhances visual document understanding and VQA by using four specialized agents and mixed reward modeling, achieving superior performance with reduced parameters.
Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.
Community
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space (2025)
- Diversity-Enhanced Reasoning for Subjective Questions (2025)
- MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval (2025)
- Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering (2025)
- Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS (2025)
- Look-Back: Implicit Visual Re-focusing in MLLM Reasoning (2025)
- MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper