Papers
arxiv:2601.06401

BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Published on Jan 10
ยท Submitted by
Rongjunchen Zhang
on Jan 12
Authors:
,
,
,
,
,

Abstract

BizFinBench.v2 presents the first large-scale financial evaluation benchmark using authentic business data from Chinese and U.S. equity markets, featuring online assessment and expert-level Q&A pairs across multiple financial scenarios.

AI-generated summary

Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general-purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real-time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert-level Q&A pairs. Experimental results demonstrate that ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek-R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business-level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain. The data and code are available at https://github.com/HiThink-Research/BizFinBench.v2.

Community

Paper author Paper submitter

BizFinBench.v2 is the secend release of BizFinBench. It is built entirely on real-world user queries from Chinese and U.S. equity markets. It bridges the gap between academic evaluation and actual financial operations.

๐ŸŒŸ Key Features

  • Authentic & Real-Time: 100% derived from real financial platform queries, integrating online assessment capabilities.
  • Expert-Level Difficulty: A challenging dataset of 29,578 Q&A pairs requiring professional financial reasoning.
  • No Judge Model: Utilizes rule-based metrics instead of dynamic judge models to ensure 100% reproducibility, high efficiency, and reliable scoring.

๐Ÿ“Š Key Findings

  • High Difficulty: Even ChatGPT-5 achieves only 61.5% accuracy on main tasks, highlighting a significant gap vs. human experts.
  • Online Prowess: DeepSeek-R1 outperforms all other commercial LLMs in dynamic online tasks, achieving a total return of 13.46% with a maximum drawdown of -8%.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.06401 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.06401 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.