arxiv:2506.00577

Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs

Published on May 31

· Submitted by

MasterZhou on Jun 3

Upvote

Authors:

Yufa Zhou ,

Shaobo Wang ,

Abstract

Post-training techniques such as Supervised Fine-Tuning and Reinforcement Learning with Verifiable Rewards improve the reasoning and economic rationality of Large Language Models in multi-agent scenarios through domain-aligned training.

AI-generated summary

Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively generalize to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce Recon (Reasoning like an ECONomist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .

View arXiv page View PDF GitHub repository Add to collection

Community

MasterZhou

Paper author Paper submitter 3 days ago

We study whether post-training techniques generalize effectively to multi-agent scenarios, using economic reasoning and game-theoretic evaluation as a testbed. We introduce Recon, a 7B LLM post-trained on 2,100 curated problems via SFT and GRPO, which induces strategic behavior without explicit gameplay data.