Glenn Matlin commited on
Commit
f4ec8ea
·
1 Parent(s): 83b220f

Update website with FLaME paper content

Browse files

- Updated index.html with complete information from the FLaME paper
- Updated author information, abstract, methodology, results, and contributions
- Added framework & resources section
- Updated links to GitHub, arXiv, and HuggingFace
- Added paper figures to the page
- Updated CLAUDE.md with comprehensive paper information
- Added FLaME.tex and content files

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>

This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. CLAUDE.md +85 -3
  2. FLaME.tex +42 -0
  3. FLaME/content/0_authors.tex +1 -0
  4. FLaME/content/10_contributions.tex +1 -0
  5. FLaME/content/1_abstract.tex +10 -0
  6. FLaME/content/2a_introduction.tex +36 -0
  7. FLaME/content/2b_relatedwork.tex +11 -0
  8. FLaME/content/3_methodology.tex +8 -0
  9. FLaME/content/3a_overview.tex +12 -0
  10. FLaME/content/3b_taxonomy.tex +21 -0
  11. FLaME/content/3c_datasets.tex +4 -0
  12. FLaME/content/3d_models.tex +7 -0
  13. FLaME/content/4_results.tex +94 -0
  14. FLaME/content/6_conclusion.tex +21 -0
  15. FLaME/content/7_limitations.tex +11 -0
  16. FLaME/content/8_ethics.tex +3 -0
  17. FLaME/content/9_acknowledgements.tex +8 -0
  18. FLaME/content/appendices/appendix_datasets.tex +64 -0
  19. FLaME/content/appendices/appendix_ethicslegal.tex +31 -0
  20. FLaME/content/appendices/appendix_incomplete.tex +101 -0
  21. FLaME/content/appendices/appendix_models.tex +61 -0
  22. FLaME/content/appendices/appendix_prompting.tex +14 -0
  23. FLaME/content/appendices/appendix_relatedwork.tex +12 -0
  24. FLaME/content/appendices/appendix_results.tex +163 -0
  25. FLaME/content/appendices/appendix_taxonomy.tex +85 -0
  26. FLaME/content/datasets/banking77.tex +2 -0
  27. FLaME/content/datasets/convfinqa.tex +1 -0
  28. FLaME/content/datasets/ectsum.tex +1 -0
  29. FLaME/content/datasets/edtsum.tex +1 -0
  30. FLaME/content/datasets/finbench.tex +1 -0
  31. FLaME/content/datasets/fincausal.tex +11 -0
  32. FLaME/content/datasets/finentity.tex +55 -0
  33. FLaME/content/datasets/finer.tex +1 -0
  34. FLaME/content/datasets/finqa.tex +1 -0
  35. FLaME/content/datasets/finred.tex +1 -0
  36. FLaME/content/datasets/fiqa.tex +2 -0
  37. FLaME/content/datasets/fnxl.tex +49 -0
  38. FLaME/content/datasets/fomc.tex +1 -0
  39. FLaME/content/datasets/fpb.tex +1 -0
  40. FLaME/content/datasets/headlines.tex +1 -0
  41. FLaME/content/datasets/numclaim.tex +1 -0
  42. FLaME/content/datasets/refind.tex +1 -0
  43. FLaME/content/datasets/subjectiveqa.tex +1 -0
  44. FLaME/content/datasets/tatqa.tex +1 -0
  45. FLaME/content/figures/fig_methodology_domain.pdf +0 -0
  46. FLaME/content/figures/fig_methodology_tasks.pdf +0 -0
  47. FLaME/content/figures/fig_overview_flow.pdf +0 -0
  48. FLaME/content/figures/fig_overview_tech.pdf +0 -0
  49. FLaME/content/tables/by_task/causal_analysis.tex +31 -0
  50. FLaME/content/tables/by_task/information_retrieval.tex +30 -0
CLAUDE.md CHANGED
@@ -1,8 +1,10 @@
1
- # CLAUDE.md - Guidelines for Nerfies Website
2
 
3
  ## Project Overview
4
- - Static website for Nerfies research project
 
5
  - No build system (pure HTML/CSS/JavaScript)
 
6
 
7
  ## Serving the Website
8
  - For local testing: `python -m http.server 8000` (will serve from current directory)
@@ -22,8 +24,88 @@
22
  - Image formats: Prefer .jpg for photos, .svg for vector graphics
23
  - Video formats: Use .mp4 for compatibility
24
  - Optimize media files for web delivery
 
25
 
26
  ## Structure
27
  - Keep all CSS in static/css/
28
  - Keep all JavaScript in static/js/
29
- - Keep media files in appropriate subdirectories
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md - Project Information and Guidelines
2
 
3
  ## Project Overview
4
+ - FLaME: Holistic Financial Language Model Evaluation
5
+ - Static website built with Bulma CSS framework
6
  - No build system (pure HTML/CSS/JavaScript)
7
+ - Research paper for ACL Annual Advances in Research (Feb 2025)
8
 
9
  ## Serving the Website
10
  - For local testing: `python -m http.server 8000` (will serve from current directory)
 
24
  - Image formats: Prefer .jpg for photos, .svg for vector graphics
25
  - Video formats: Use .mp4 for compatibility
26
  - Optimize media files for web delivery
27
+ - Paper figures are in PDF format in FLaME/content/figures/
28
 
29
  ## Structure
30
  - Keep all CSS in static/css/
31
  - Keep all JavaScript in static/js/
32
+ - Keep media files in appropriate subdirectories
33
+ - Paper content in FLaME/content/
34
+
35
+ ## FLaME Research Paper Information
36
+
37
+ ### Authors
38
+ - Oopy Goopy, General Munchkin Man, L'il Jim Bob, Larry
39
+ - Affiliation: Georgia Institute of Technology
40
+
41
+ ### Paper Focus and Objective
42
+ - First comprehensive benchmarking framework for evaluating language models on financial NLP tasks
43
+ - Addresses gaps in existing evaluation methodologies for financial language models
44
+ - Provides standardized evaluation framework with open-source implementation
45
+
46
+ ### Key Components
47
+
48
+ #### Taxonomy
49
+ - Organized by three dimensions: tasks, domains, and languages
50
+ - Six core FinNLP tasks:
51
+ 1. Text classification
52
+ 2. Sentiment analysis
53
+ 3. Information retrieval
54
+ 4. Causal analysis
55
+ 5. Text summarization
56
+ 6. Question answering
57
+ - Domains categorized by data source, origination, time period, etc.
58
+ - Currently focuses on English language
59
+
60
+ #### Datasets
61
+ Selected based on:
62
+ - Financial domain relevance
63
+ - Fair usage licensing
64
+ - Annotation quality
65
+ - Task substance
66
+
67
+ Key datasets include:
68
+ - Banking: Banking77, FiQA, FinRED
69
+ - Investment: FPB, Headlines, SubjectiveQA
70
+ - Accounting: FinQA, TaT-QA, ConvFinQA
71
+ - Corporate: ECTSum, EDTSum, FinCausal
72
+ - Monetary Policy: FOMC, FNXL
73
+ - Cross-domain: FinBench, NumClaim, ReFINED
74
+
75
+ #### Models Evaluated
76
+ - Proprietary closed-source: GPT-4o & o1-mini, Gemini-1.5, Claude3, Cohere Command R
77
+ - Open-weight: Llama-3, DeepSeekV3 & R-1, Qwen-2 & QwQ, Mistral, Gemma-1 & 2, Mixtral, WizardLM2, DBRX
78
+ - Used deterministic decoding (temperature 0.0, top p of 0.9, repetition penalty of 1)
79
+
80
+ #### Evaluation Process
81
+ - Two-stage approach: generation and extraction
82
+ - Task-specific metrics: accuracy, F1 scores, precision, recall, BLEU scores
83
+ - Standardized zero-shot evaluation
84
+
85
+ ### Key Findings
86
+ - No single model performs best across all tasks
87
+ - Performance varies significantly based on domain and task structure
88
+ - Open-weight models show strong cost/performance efficiency
89
+ - Numeric reasoning tasks remain challenging for all models
90
+ - Inconsistent scaling: larger parameter sizes don't guarantee higher performance
91
+ - Models struggle with consistent numeric formats and longer label sets
92
+ - Top performers: DeepSeek R1, OpenAI o1-mini, Claude 3.5 Sonnet
93
+
94
+ ### Limitations
95
+ - Limited dataset size and diversity
96
+ - Focus on zero-shot scenarios only
97
+ - English-language focus
98
+ - No evaluation of advanced prompting techniques
99
+ - Doesn't capture full breadth of real-world financial scenarios
100
+
101
+ ### Future Directions
102
+ - More advanced prompt engineering
103
+ - Domain-adaptive training for numeric/causal tasks
104
+ - Benchmarking efficiency trade-offs
105
+ - Multi-lingual coverage expansion
106
+
107
+ ### Resources
108
+ - Paper PDF: FLaME/FLaME__ACL_AAR_Feb_2025_.pdf
109
+ - ArXiv: https://arxiv.org/abs/2402.14017
110
+ - GitHub: https://github.com/flame-benchmark/flame
111
+ - HuggingFace: https://huggingface.co/spaces/flame-benchmark/flame
FLaME.tex ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \pdfoutput=1
2
+ \documentclass[11pt]{article}
3
+ \usepackage[final]{acl}
4
+ \usepackage{times}
5
+ \usepackage{latexsym}
6
+ \usepackage[T1]{fontenc}
7
+ \usepackage[utf8]{inputenc}
8
+ \usepackage{microtype}
9
+ \usepackage{inconsolata}
10
+ \usepackage{tabularx}
11
+ \usepackage[table, dvipsnames]{xcolor}
12
+ \input{macros}
13
+ \title{Holistic Finance Language Model Evaluation}
14
+ \input{content/0_authors}
15
+ \begin{document}
16
+ \maketitle
17
+ \input{content/1_abstract}
18
+ \input{content/2a_introduction}
19
+ \input{content/2b_relatedwork}
20
+ \input{content/3_methodology}
21
+ \input{content/3a_overview}
22
+ \input{content/3b_taxonomy}
23
+ \input{content/3c_datasets}
24
+ \input{content/3d_models}
25
+ \input{content/4_results}
26
+ \input{content/6_conclusion}
27
+ \input{content/7_limitations}
28
+ \input{content/8_ethics}
29
+ % \input{content/9_acknowledgements}
30
+ % \input{content/10_contributions}
31
+ \bibliographystyle{acl_natbib}
32
+ \bibliography{paperpile}
33
+ \appendix
34
+ \input{content/appendices/appendix_taxonomy}
35
+ \input{content/appendices/appendix_datasets}
36
+ \input{content/appendices/appendix_models}
37
+ \input{content/appendices/appendix_prompting}
38
+ \input{content/appendices/appendix_results}
39
+ \input{content/appendices/appendix_relatedwork}
40
+ \input{content/appendices/appendix_incomplete}
41
+ \input{content/appendices/appendix_ethicslegal}
42
+ \end{document}
FLaME/content/0_authors.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \author{Oopy Goopy, {\bf General Munchkin Man}, {\bf L'il Jim Bob}, {\bf Larry} \\ Georgia Institute of Technology}
FLaME/content/10_contributions.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ % \section*{Contributions} \label{sec:contributions}
FLaME/content/1_abstract.tex ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{abstract}
2
+ Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first \textbf{\textit{Holistic}} benchmarking suite for \textbf{\textit{Financial Language Model Evaluation}} (\papertitle). We are the first research paper to comprehensively study LMs against 'reasoning-reinforced' LMs, with an empirical study of \nummodels foundation LMs over \numtasks core NLP tasks in finance. We open-source our framework software along with all data and results.
3
+ \end{abstract}
4
+ \\
5
+ \begin{figure}[!b]
6
+ \centering
7
+ \includegraphics[width=1\linewidth]{content/figures/fig_overview_tech.pdf}
8
+ \caption{\textbf{Technical Overview of \papertitle.}}
9
+ \label{fig:overview_tech}
10
+ \end{figure}
FLaME/content/2a_introduction.tex ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Introduction}\label{sec:introduction}
2
+ \input{content/tables/table__us_vs_them}
3
+ \begin{figure*}
4
+ \centering
5
+ \includegraphics[width=1\linewidth]{content/figures/fig_overview_flow.pdf}
6
+ \caption{\textbf{Functional Overview of \papertitle.}}
7
+ \label{fig:overview}
8
+ \end{figure*}
9
+ Benchmarks and datasets are the foundation for Artificial Intelligence (AI) research. How the research community collectively defines \textit{'success'} directly sets the priorities and goals of individual researchers for their investigations \cite{Raji2021-kj}. Defining and implementing benchmarks is how the wider research community understands the progress of AI development \cite{Birhane2022-az}. Recent developments enabling the general commercial availability of foundation Language Models (LMs) \cite{Bommasani2021-ro, Zhao2023-mn} (\eg, ChatGPT \cite{Brown2020-zf}, Claude \cite{AnthropicUnknown-vl}, Gemini \cite{Gemini-Team2023-gv}, LLaMa \cite{Touvron2023-ia} etc.) means there is now widespread interest in tracking the progress of LM systems \cite{Chang2023-wf,Nie2024-mh}. The widespread availability of LMs has enabled an explosion of research on AI capabilities for many knowledge-intensive and highly-specialized domains, such as medicine, law, and finance \cite{Guha2024-ox,Chen2024-zb,Kaddour2023-sg}. Prior research has raised serious concerns about the ability of LMs to generalize their reasoning or adapt to specialized domains \citep{Bender2021-wf, Kocon2023-py}, particularly finance \citealp{Kang2023-hn, Zhao2024-eh, Dong2024-xd, Chen2024-zb}.
10
+ Despite this explosion of interest and skepticism, there has not yet been a sufficiently rigorous and \textbf{\textit{holistic evaluation}} of the performance of foundation LMs for core NLP tasks in finance. Existing state-of-the-art efforts lack sufficient standardization and rigor to identify the true performance bounds of foundation LMs. Poor understanding of these errors leads to real-world failures in financial computing systems.
11
+ % Computing errors in finance have long had significant real-world consequences, even before the advent of AI.
12
+ The risk of failures in AI-enabled financial systems should be a primary concern for both academia and industry. Without a deep understanding of common failures in LM-enabled finance NLP tasks (e.g., generating incorrect financial data), these systems may mislead users, leading to substantial harm. Misinformation stemming from analytical failures, flawed reasoning, or outright hallucinations remains a persistent challenge and may be difficult, if not impossible, to fully eliminate \citep{Ye2023-qn, Li2023-oo}.
13
+ %\cite{Xu2024-cm, Ji2022-fk}. % not working?
14
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
15
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
16
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
17
+ Over the past few years, multiple benchmark evaluations suites have emerged to assess model performance on finance-oriented NLP tasks. However, these efforts typically:
18
+ \begin{enumerate}
19
+ \item Serve as a collection of benchmarks without establishing an in-depth taxonomy,
20
+ \item Neglect to standardize criteria for data selection or evaluation,
21
+ \item Lack a systematic recognition of incompleteness of current methods, and
22
+ \item Narrow evaluation scope with only fine-tuned or closed-source LMs.
23
+ \end{enumerate}
24
+ \textbf{\textit{Holistic evaluations}} are critical for AI in finance as the failure of these systems caused by an insufficient understanding of the weaknesses and errors of LMs on core NLP tasks in finance will cause serious harm to the general public, along with the economic and legal implications for businesses or financial institutions. We adopt the widely accepted meaning of \textit{holistic evaluation} from \citet{Liang2022-ew}, which defined the three requirements for a holistic evaluation: \textbf{(1) standardization}, \textbf{(2) recognition of incompleteness}, and \textbf{(3) multi-metric evaluation}. \textbf{\textit{Holistic benchmark suites}} help prevent these errors by identifying gaps in data coverage in their dataset taxonomy, encouraging comprehensive study of model behavior, and providing a reliable and repeatable method for comparison.
25
+ % Holistic evaluations give the research community a deeper understanding of the robustness and reliability of AI and identify gaps in current research.
26
+ However, no benchmark suites for evaluating core NLP finance tasks on LMs meet the definition of 'holistic.' In \reftab{us-vs-them}, we assess other existing benchmarks and highlight how they fail to meet the criteria for a holistic evaluation.
27
+ % Without an established holistic evaluation, there is a significant risk of harm caused by an insufficient understanding of the weaknesses and errors of LMs on core NLP tasks in finance.
28
+ To solve this critical gap for our community, we propose \papertitle, which provides the following novel contributions:
29
+ \begin{enumerate}
30
+ \item \textbf{Standardized Evaluation Framework}: We release an \textbf{open-source software} for creating standardized pipelines for LM evaluations for core financial NLP tasks. Our customizable pipeline (see \reffig{overview}) manages everything.
31
+ \item \textbf{Large-Scale Model Assessment}: We conduct \textbf{extensive evaluations} of \nummodels open-weight and proprietary LMs, exposing strengths and weaknesses across \numtasks financial benchmarks (see \reffig{overview_tech}). We provide a \textbf{meta-analysis} of the results, including a \textit{study on the performance/cost trade-off space}. Our in-depth \textbf{error analysis} offers more insight into recurring model failures.
32
+ \item \textbf{Living Benchmark}: We provide a \textbf{public leaderboard} to encourage continuous updates. Researchers and practitioners can contribute new datasets or model results, extending \papertitle beyond our initial contributions. By design, this effort \textit{\textbf{explicitly}} welcomes peer review and invites ongoing collaboration.
33
+ \item \textbf{Taxonomy and Dataset Selection}: We present a holistic taxonomy for financial NLP tasks, detailing the financial domain scenario and categorizing benchmarking tasks. We also establish \textbf{clear inclusion criteria} (domain relevance, licensing, label quality).
34
+ % and identify underrepresented areas requiring additional data contributions from the community.
35
+ \end{enumerate}
36
+ % The remainder of this paper is structured as follows. First, we compare \refsec{relatedwork} and draw direct and tangible comparisons against our approach. Next, we outline the \papertitle \refsec{methodology} \emdash its novel taxonomy, stringent dataset criteria, and automated evaluation pipeline. We present our experimental \refsec{results}, including quantitative metrics and qualitative analyses of error patterns. We end the study with our \refsec{conclusions}, providing a meta-analysis of results and a summary of our contributions. Finally, we discuss the \refsec{limitations} of this work associated with our initial dataset scope (\eg, spatial, cultural, and temporal bias) and our cost constraints for data collection.
FLaME/content/2b_relatedwork.tex ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Related Work}\label{sec:relatedwork}
2
+ \subsection{Foundation Language Models}
3
+ The past few years have seen remarkable progress in LMs, driving state-of-the-art performance across a broad range of core NLP tasks in the financial domain. LMs exhibit strong performance on both general-domain benchmarks and increasingly complex tasks (\eg, multi-hop reasoning, tool use, multi-modal)
4
+ % \citeTODO.
5
+ The term "large language model" has increased rapidly in use; however, its definition is broad enough to encompass fine-tuned models or systems. We define language models as probabilistic models for natural language and foundation models as those trained on broad datasets (typically using large-scale self-supervision) that can be adapted (e.g., fine-tuned) for a wide range of downstream tasks. \cite{Bommasani2021-ro}. Our study aims for a robust and holistic understanding of LM performance rather than use-case-specific adaptations. We prioritize studying foundation LMs, as all fine-tuned models originate from a foundation model. The performance of fine-tuned models heavily depends on the pre-training stage (\ie, self-supervised learning) of the foundation model \cite{Chia2023-sb}.
6
+ \subsection{Language Model Evaluation}
7
+ \textit{\textbf{Domain-specific}} evaluations for knowledge-intensive fields (\eg, medicine, law, computing) have seen much research interest \cite{Guha2024-ox,Chen2024-zb,Kaddour2023-sg}. However, the amount of research dedicated to \textbf{\textit{finance-specific}} evaluations is relatively under-studied. Without further evaluation, LMs used in financial systems may lead to incorrect predictions, misinterpretations of regulatory text, flawed market analysis, and other significant financial risks.\\
8
+ A robust body of research has focused on developing benchmarks to measure the evolving capabilities of LMs in broad NLP contexts. Landmark resources such as GLUE \citep{Wang2018-qm}, SuperGLUE \citep{Wang2019-jb}, SQuAD \cite{Rajpurkar2016-si}, HellaSwag \cite{Zellers2019-lp}, and others have helped standardize the evaluation of general natural language understanding for AI. Subsequent benchmarking efforts including MMLU \citep{Hendrycks2020-rz, Wang2024-pk}, Dynabench \cite{Kiela2021-pi, Ma2021-zb}, BigBench \citep{Srivastava2022-tn, Suzgun2022-pp}, the AI2 Reasoning Challenge (ARC) \citep{Clark2018-yg}, and many others have introduced more challenging domains, spanning multi-step reasoning, commonsense tasks, and even agent interactions.
9
+ Of particular relevance is \textbf{Holistic Evaluation of Language Models (HELM)} \citepHELM, which advocates an approach with \textbf{\textit{three core requirements}}: (1) standardized evaluation methods, (2) multi-metric assessments, and (3) an explicit recognition of benchmark incompleteness. While these benchmarks have significantly helped with research on general LM capabilities, they do not explicitly address the intricacies of finance-specific applications, such as handling financial definitions, regulatory language, and domain-specific reasoning.
10
+ \subsection{Financial Task Benchmarks}
11
+ Datasets and benchmarks serve a foundational role in the evaluation of AI systems for finance. Although researchers investigated LMs for finance \cite{Wu2023-ph}, how to rigorously evaluate such models remains an open challenge. In this work, we build on general and domain-specific insights to propose a more holistic evaluation framework for LMs specifically for finance. Researchers have begun adapting or creating benchmark suites tailored to finance, \textbf{yet no existing work meets the criteria for holistic evaluation} of financial scenarios. In \reftab{us-vs-them}, we assess these existing benchmarks and highlight how these benchmarks fail to meet the criteria for a holistic evaluation. We provide a full discussion and comparison of \papertitle with prior works in \refapp{relatedwork}.
FLaME/content/3_methodology.tex ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ \section{Methodology}\label{sec:methodology}
2
+ We present our methodology for holistic financial language model evaluation. \papertitle is the first \textit{\textbf{holistic}} benchmark suite for core NLP tasks in finance. Using this proposed methodology enables researchers to focus on evaluating the \textit{fundamental} abilities of \textit{foundation} models.
3
+
4
+ % The structure of the methodology section is as follows: First, we discuss how the software framework of \refsec{holiflame} implements strict controls on data splits, prompt formatting, and inference parameters, enabling reproducible and fair comparisons of diverse LMs. Researchers can rely on \textit{standardized} evaluations to control for as many aspects of LM as possible and make meaningful comparisons. Then we discuss our \refsec{taxonomy} for finance-specific core NLP tasks. By clearly structuring \refsec{datasets} by tasks, domains, and languages, our taxonomy highlights the underrepresented areas of financial NLP, guiding future data-collection efforts. Finally, we describe our criteria for \refsec{models} selection and detail the sampling strategy for token generation.
5
+
6
+ % Our \textit{\textbf{scenario}} based taxonomy specifies \textit{what} a model should do (the task), the \textit{context} in which the text arises (the domain), and the \textit{language} or medium. For our financial setting, we define each scenario as a specific finance-oriented NLP task (\eg, question answering, sentiment analysis) based on a full-definition of the domain (\eg, earnings-call transcripts from 2022, regulatory filings from U.S. institutions).
7
+ % Our methodology helps the community identify potential risks or strengths that could remain hidden by focusing on a single domain or task.
8
+ % ... we demonstrate how our holistic evaluation uses a standardized method for evaluating each model across \textbf{\textit{multiple metrics}} and analyze the trade-offs and weaknesses of foundation LMs (\refsec{metrics}).
FLaME/content/3a_overview.tex ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \subsection{\papertitle}\label{sec:holiflame}
2
+ We conducted quality checks (license validation, label audits) to ensure each dataset meets the \textbf{inclusion criteria} described in \refapp{datasets}. Full credit and acknowledgment is given to the authors of these benchmarks. We provide all the pre-processing code used for these datasets, directing reader traffic to their original hosting source. We encourage all readers to refer to our extensive discussion in \refapp{ethicslegal} on the ethics and legal matters regarding appropriate use by others. To promote collaboration and transparent reporting, \papertitle provides a public leaderboard (Note: All details are withheld here for double-blind review).\\
3
+ % \para{\papertitle Pipeline.}
4
+ The evaluation pipeline proceeds in stages:
5
+ \begin{enumerate}
6
+ \item \textbf{Configuration:} Users select desired tasks, datasets, and model parameters.
7
+ % Meta-scoring can also be enable to emphasize certain tasks for each use-case.
8
+ \item \textbf{Model Interaction:} The system queries each LM \dash via local instantiation or a remote API \dash to collect its outputs. We automatically handle token limits, rate-limiting, and retry logic for cloud services.
9
+ \item \textbf{Post-processing and Extraction:} Generated text undergoes parsing, ensuring any structured output is normalized.
10
+ \item \textbf{Metric Computation:} User-specified metrics are computed. All parameters (prompt, settings) are logged.
11
+ \end{enumerate}
12
+ This modular design \emph{decomposes} complex tasks, allowing researchers to customize each step \dash e.g., hooking in novel prompt engineering, or adding new metrics. By default, \papertitle \emph{checkpoints} each step to guarantee reproducibility and traceability of results. We anticipate the community will extend or refine these modules as financial NLP evolves.
FLaME/content/3b_taxonomy.tex ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \subsection{Taxonomy}\label{sec:taxonomy}
2
+ Previous benchmark suites act largely as a collection of datasets with nonstandard definitions of task categories. \papertitle instead uses a \textit{\textbf{scenario}}-based taxonomy. Our taxonomy improves on prior works by defining the complex scenario space of FinNLP tasks. Unlike prior works, the \papertitle taxonomy categorizes financial data based on their primary characteristics and attributes. We define our taxonomy based on these characteristics to avoid creating superfluous categories that unnecessarily add complexity by diverging from established NLP terminology. Our taxonomy is intentionally designed to rely on broad categories (with subcategories as appropriate) to maintain a balance between simplicity and granularity.
3
+ % The usefulness of a taxonomy is related to its ability to inform researchers and guide further studies.
4
+ By highlighting the complex space of different financial scenarios, our taxonomy highlights the current paucity of data and the need for more research work on financial LM benchmarks. The \papertitle website allows users to browse all available datasets and results using our taxonomy. Our work allows researchers to perform deep analysis of the availability and quality of benchmarking datasets. We emphasize the idea that every possible financial \textit{scenario} (i.e., what the LM should do) can be represented with a combination of three attributes: \textit{tasks}, \textit{domains}, and \textit{languages}.
5
+ \begin{figure}
6
+ \centering
7
+ \includegraphics[width=1\linewidth]{content//figures/fig_methodology_tasks.pdf}
8
+ \caption{\textbf{Illustrative breakdown for each of the six core NLP task categories.} While our taxonomy groups these tasks broadly, each category can encompass numerous specialized variants depending on data format, user needs, and domain constraints. We provide a limited set of specific examples to illustrate the concepts.}
9
+ \label{fig:methodology_tasks}
10
+ \end{figure}
11
+ \begin{figure*}
12
+ \centering
13
+ \includegraphics[width=1\linewidth]{content//figures/fig_methodology_domain.pdf}
14
+ \caption{\textbf{Holistic Taxonomy for \papertitle.} Previous FinNLP benchmark suites functioned as a collections of datasets tied to a specific task and often only using a single metric to report performance. By comparison, \papertitle takes a \textit{holistic} approach of enumerating out the full space of tasks, scenarios, and metrics. We consider each benchmark and taxonomize it across multiple dimensions for a complete analysis.
15
+ % With our final taxonomy we then can explicitly describe the decision making process around the implementation and evaluation of any benchmark. Our taxonomy also helps researchers identify what kinds of coverage this paper and others lack (\eg, multi-lingual, multi-modal, etc.)
16
+ }
17
+ \label{fig:methodology_domain}
18
+ \end{figure*}
19
+ \paragraph{Tasks.} In \papertitle, we consider six core FinNLP tasks (see \reffig{methodology_tasks}), each selected for their relevance and importance for FinNLP. These tasks reflect real-world financial applications such as document retrieval, risk classification, and automated financial analysis. The categories are designed to be broad enough to capture most FinNLP applications while remaining specific enough to support rigorous evaluation.
20
+ \paragraph{Domains.} Each dataset is classified by its domain, which considers what the data represents, who produced it, where it originates, when it was generated, how it was created, and why it exists. Domains include financial institutions, regulators, news media, small businesses, and individual investors. \citet{Liang2022-ew} organizes \textit{\textbf{domains}} primarily by the "3 W’s," describing what (genre of text), when (time period), and who (demographic or author source). We expand on this definition for finance by detailing additional attributes such as "where" origin (\eg \textit{specific} regulatory bodies, which financial) and "how" for data types (e.g., \textit{transcribed} earnings transcripts, \textit{human-annotated} SEC filings). This refinement ensures we capture the domain complexity unique to financial text sources.
21
+ \paragraph{Languages.} Our taxonomy currently focuses on English-language financial datasets but acknowledges the need for multilingual FinNLP resources, particularly for global markets.
FLaME/content/3c_datasets.tex ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ \subsection{Datasets}\label{sec:datasets}
2
+ We construct \papertitle’s dataset suite according to explicit selection criteria that ensure \textbf{financial domain relevance}, \textbf{fair usage licensing}, \textbf{annotation quality}, and \textbf{task substance}. Datasets must focus primarily on \textit{financial} text rather than tangential business or economic references. We exclude datasets that are not made public to researchers, do not have research-friendly licensing, or that do not explicitly credit original data authors. While \papertitle primarily covers \textbf{\textit{core}} NLP tasks (\reffig{methodology_tasks}), certain \textbf{\textit{frontier scenarios}} (e.g., decision-making, tool-use, market forecasting) lie outside this initial scope. These tasks require deeper domain knowledge, additional metrics, and robust guardrails. We aim to incorporate them in future expansions. After applying the above criteria, we selected \numtasks datasets for \papertitle.
3
+ % \reftab{taxonomy_data}
4
+ Table 10 provides a complete list of each dataset, along with domain type, annotation method, and usage license. We perform \textbf{quality assurance} on each dataset for label consistency, domain specificity, and minimal data leakage. When previous studies or the community flag serious issues (e.g., skewed entity labeling, incomplete coverage), we either exclude the dataset or advise caution. For instance, prior work identified that some “CRA NER” corpora have oversimplified entity types, potentially distorting real-world distribution. We exclude such datasets or relegate them to an \emph{experimental} status if they do not meet our threshold for reliability. We also exclude benchmarks that attempt purely numeric or time-series forecasting with no natural language component, as these do not align with our focus on core NLP tasks. Please see \refapp{datasets} for full details on data selection criteria, along with additional discussion on data leakage, recommended salted hashes, and excluded datasets.
FLaME/content/3d_models.tex ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ \subsection{Evaluation}\label{sec:models}
2
+ \para{Models:} We select models that are not multi-modal to focus our study on their NLP capabilities. Multi-modal models are an aspect of frontier research that deserves a separate dedicated research study (see \refapp{incomplete} for details).\\
3
+ We study following LM families with \papertitle: Proprietary closed source systems GPT 4o \& o1-mini, Gemini-1.5, Claude3, and Cohere Command R. Along with open weight models including Llama-3, DeepSeekV3 \& R-1, Qwen-2 \& QwQ, Mistral, Gemma-1 \& 2, Mixtral, WizardLM2, and DBRX. All experiments involving large language models (LMs) were conducted using cloud-based APIs. We utilized commercial API access for models such as OpenAI’s GPT, Google’s Gemini, and Anthropic’s Claude, and others.\\
4
+ We include known details on foundation LMs such as architecture, training data, and model parameters, etc. in \reftab{model-detail}. Results from open-source or open-weight models can be considered a true measure of the LM performance rather than wide system adaptations or methods, due to the lack of reproducibility and transparency regarding any closed-source models or systems.\\
5
+ \para{Extractions} During evaluation, the primary language model generates responses to task-specific inputs. These responses undergo a structured extraction process using a separate language model to identify relevant output elements. This two-stage approach separates the generation and extraction steps, enabling robust evaluation across different response formats. The extraction phase employs rule-based pattern matching and regular expressions to identify specific elements within model outputs. This systematic approach ensures consistent response parsing across different tasks and model architectures. The framework maintains separate evaluation criteria for financial classification, numerical reasoning, and text generation tasks.\\
6
+ \para{Evaluation:} Performance measurement occurs through task-specific metrics, including accuracy, F1 scores, precision, recall, and BLEU scores for generation tasks. These metrics are computed using standardized implementations to ensure consistency across evaluations. \papertitle aggregates results by grouping scores according to task categories and financial domains. A configurable weighting system allows adjustment of score importance based on task difficulty and domain relevance. The final meta-score computation accounts for the relative performance range of models across tasks, providing a balanced assessment of financial language understanding capabilities.\\
7
+ \para{Generation:} Decoding strategies are methods for how the LM generates text tokens \cite{Wiher2022-pe}. Decoding strategy has different settings for temperature, top p, and repetition penalty, which influence the randomness and diversity of the output token sequence. Our \textit{‘deterministic’} strategy uses a temperature of 0.0, top p of 0.9, and a repetition penalty of 1. We choose the deterministic decoding strategy to gather the most predictable and consistent results across samples, due to the nature of benchmarking models which emphasize accuracy and reliability are required. Deterministic decoding is most important for tasks such as data extraction or structured text generation which are commonly found in finance due to the improved performance from low temperatures \cite{Liang2024-cz,Zarriess2021-ty}
FLaME/content/4_results.tex ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{table*}[h]
2
+ \centering
3
+ \resizebox{\textwidth}{!}{%
4
+ \input{content/tables/main_table}
5
+ }
6
+ \caption{\textbf{Overview of \papertitle Results.} This table compares results across all datasets and all models in \papertitle. We note reasoning-reinforced models as \textbf{bold text} and mixture of expert models with \emph{italics}. For full dataset details, see \refapp{datasets}. * indicates the dataset belongs in both IR and SA.}
7
+ \label{tab:main_table}
8
+ \end{table*}
9
+ \section{Experiments and Results}\label{sec:results}
10
+ In this section, we present the results of our holistic evaluation of LMs across a variety of core NLP tasks for finance, focusing on multiple dimensions: \emph{performance} and \emph{efficiency} in terms of inference overhead and cost.
11
+ We evaluated \nummodels{} language models (LMs) on the \papertitle{} benchmark suite.
12
+ \reftab{main_table} provides a high-level scoreboard across six main task categories:\footnote{Datasets are introduced in \refsec{datasets}.}
13
+ We also detail each dataset's unique domain requirements, the metrics used, and final model performances in separate tables (see \refapp{results}).
14
+ Overall, the results reveal three key insights:
15
+ \begin{enumerate}
16
+ \item No single LM performs the best across all tasks, but a handful of models show strong overall performance.
17
+ \item Performance depends heavily on the domain and task structure, \ie numeric reasoning vs entity classification.
18
+ \item Open-weight and mid-scale models shined at cost/performance efficiency, highlighting the importance of further scientific research.
19
+ \end{enumerate}
20
+ We organize the following subsections around (1) a \textit{meta-analysis} of the results. For the \textit{model-specific} observations or \textit{per-task} discussion, please refer to \refapp{task_results}
21
+ \subsection{Meta-Analysis of Results}
22
+ \label{sec:meta-analysis}
23
+ \para{Key Takeaways.}
24
+ \reftab{main_table} shows that certain LMs consistently perform well on multiple tasks\emdash~\eg \textbf{DeepSeek R1} leads in many IR tasks and advanced QA settings, \textbf{Claude 3.5 Sonnet} excels in sentiment (\textsc{FPB}) and some IR tasks (\textsc{FinRED}), and \textbf{GPT-4o} hovers near the top in classification and summarization.
25
+ Nevertheless, there was \emph{no single model that wins overall}: while \textbf{DeepSeek R1} dominates multi-step QA (e.g., \textsc{ConvFinQA}, \textsc{TATQA}), trails in summarization. Performance can vary between even similar tasks, as \textbf{Claude 3.5 Sonnet} leads \textsc{FinQA}, but not necessarily multi-turn \textsc{ConvFinQA}.\\
26
+ \para{Domain-Specific Challenges.}
27
+ Numeric reasoning tasks (like \textsc{FNXL} for numeric labeling or \textsc{ConvFinQA} for multi-step financial statements) remain especially challenging, with F1 scores for \textsc{FNXL} often below 0.06, signaling that even large models struggle to precisely map an extremely large amount of categories to numeric content. The relatively low scores on \textsc{ConvFinQA} compared to basic classification or retrieval tasks like \textsc{REFinD} and \textsc{Headlines} suggest that LMs suffer from sharp performance drops on tasks requiring step-by-step deductions, calculations, or cross-referencing, which could impede their application to financial forecasting and decision-making.\\
28
+ By contrast, summarization tasks yield relatively high BERTScores (0.75\dash0.82 for most models), indicating that summarization in financial contexts\emdash though non-trivial\emdash seems more tractable or amenable to the generic capabilities of foundation LMs. This could be due to those tasks only requiring LMs to identify and output the key parts of the input task, rather than having to generate text or reason through a problem.\\
29
+ \para{Inconsistent Scaling.}
30
+ Our results corroborate that \emph{larger parameter sizes does not strictly guarantee higher performance}:
31
+ For instance, \textsc{Jamba 1.5 Mini} outperforms many bigger models in \textsc{FinBench}, and \textsc{Gemma 2 9B} can match or exceed larger model variants on \textsc{Banking77} or \textsc{Headlines}.
32
+
33
+ \subsection{Further Error Analysis and Discussion}
34
+
35
+ In addition to the aggregate results, we highlight some error patterns:
36
+
37
+ \para{Numeric Reasoning Gaps.}
38
+ Despite partial success in \textsc{FinQA} or \textsc{ConvFinQA}, many LM outputs fail to produce consistent numeric or textual formats (e.g., rounding vs.\ decimal, underscore vs. dash) or handle cross-sentence references. This can be especially detrimental in \textsc{FNXL} labeling.
39
+
40
+ \para{Language Drift and Prompt Issues.}
41
+ Some models (e.g., Qwen\,2\,72B) occasionally drift into non-English outputs for summarization. Additionally, longer label sets (e.g., \textsc{Banking77} with 77 classes) can yield off-list label predictions, harming F1. This could be due to models struggling to precisely remember everything in their context window.
42
+
43
+ \para{Causal Data Scarcity.}
44
+ Given the specialized financial domain, training data for causal detection or classification is limited. Our results reinforce that it remains a bottleneck, and external knowledge or additional reasoning modules might be necessary.
45
+
46
+ \para{Inference Efficiency.}
47
+ While not the primary focus of this work, we note that certain tasks like \textsc{ConvFinQA} significantly increase token usage and inference cost, raising practical concerns for real-world deployments at scale.
48
+
49
+ % Overall, our evaluations show that LM performance on financial tasks is highly task- and domain-dependent. Contrary to the common assumption of a near-monotonic relationship between model scale and performance, many of our benchmarks reveal \emph{inconsistent scaling patterns}.
50
+ \subsection{Efficiency Analysis of Model Performance}\label{sec:meta-analysis-results}
51
+ Beyond raw accuracy or F1, a critical factor for FinNLP is \emph{efficiency}. Tasks such as multi-turn financial question answering (\textsc{ConvFinQA}) and advanced causal classification require lengthy in-context prompts, leading to high inference costs. Critical to note is that smaller models sometimes outperform larger ones by offering a superior trade-off between \emph{throughput} and \emph{accuracy}, making them more viable for real-world applications.\\
52
+ For all of our inference runs, DeepSeek R1 cost approximately \$260 USD compared to Claude 3.5 Sonnet's and o1-mini's ~\$105 USD and Meta Llama 3.1 8b's ~\$4 USD. This dramatic price difference suggests that users should choose model carefully based on use-case, as slightly lower performing models might have dramatically cheaper inference costs. For example, models such as Llama 3.1 70b and DeepSeek-V3 cost less than \$25 USD.\\(See \refapp{efficiency-analysis} for full details and cost.)
53
+
54
+ % Unless otherwise indicated, we show aggregated metrics (accuracy, F1, BERTScore, etc.) in the main text;
55
+ % Full per-model results appear in \refapp{results}.
56
+
57
+ % \subsubsection{Text Classification Tasks}
58
+ % \label{sec:classification-tasks}
59
+
60
+ % \noindent\textbf{Banking77.}
61
+ % Most models achieve moderate success on \textsc{Banking77} (intent classification), with an average accuracy of around 57.9\%. LMs generally cope well with short textual inputs and a finite set of classes, although they occasionally produce labels outside the defined set. Prompt design can mitigate these formatting issues by explicitly enumerating allowable outputs.
62
+
63
+ % \noindent\textbf{Causal Classification.}
64
+ % Financial causal detection proves notably difficult, with F1 scores ranging from 14\% to 37\%. The higher-scoring Llama\,3\,8B surpasses much larger models, reinforcing the idea that architectural nuances or domain adaptation might be more impactful than scale. The task demands deeper reasoning about cause--effect relationships in textual financial reports, something that few-shot prompting alone may not fully capture.
65
+
66
+ % \noindent\textbf{FinBench and FiQA (Classification Component).}
67
+ % On certain \textsc{FinBench} classification subsets and \textsc{FiQA} sentiment classification, the results vary widely across models. Some smaller or mid-sized LMs outperform larger ones due to more suitable prompts or better in-domain pre-training. In general, classification tasks remain more tractable than generative or multi-step reasoning tasks, yet the wide variance in label formatting and domain-specific jargon complicates zero-shot or few-shot settings.
68
+
69
+ % \subsubsection{Entity Recognition and Extraction}
70
+
71
+ % \noindent\textbf{FiNER.}
72
+ % Entity recognition in financial text remains a challenge, with F1 scores clustering around 12--16\%. Despite the relative simplicity of NER tasks in open-domain contexts, \textsc{FiNER} includes specialized entity types (e.g., financial instruments, macroeconomic terms) that are less common in general-purpose corpora. This leads to reduced model confidence and higher rates of hallucinated or incomplete spans.
73
+
74
+ % \noindent\textbf{FinEntity.}
75
+ % Performance on \textsc{FinEntity} exhibits high variability (0--48\% F1). Notably, \textsc{Qwen\,2\,72B} achieves up to 48\% F1, suggesting that domain-rich pre-training and well-designed prompts can significantly boost entity extraction in finance. Still, the gap between the top models and lower-tier models is large, indicating a need for more robust financial NER.
76
+
77
+ % \subsubsection{Summarization Tasks}
78
+
79
+ % \noindent\textbf{EDTSum and ECTSum.}
80
+ % Summarization tasks see substantially higher scores (\(\approx\)80--82\% BERTScore F1), suggesting that LMs have a relatively easier time identifying key statements or phrases in financial documents than conducting numeric or causal analysis. Extractive summarization especially benefits from large-scale pre-training that emphasizes capturing salient textual segments.
81
+
82
+ % Despite these promising results, some models revert to non-English outputs (e.g., \textsc{Qwen\,2\,72B} occasionally drifting into Chinese), highlighting potential training biases. Furthermore, issues of data contamination remain: if a summarization dataset appears in a model’s pre-training corpus, its zero-shot performance might be inflated.
83
+
84
+ % \subsubsection{Question Answering and Numerical Reasoning}
85
+
86
+ % \noindent\textbf{FinQA \& TATQA.}
87
+ % These hybrid datasets require multi-step arithmetic and table-based reasoning. Accuracy ranges from 26--45\%, illustrating moderate capability in numerical reasoning but also clear difficulties in decimal formatting, rounding, and multi-hop retrieval. Minor differences in how ground-truth answers are represented (e.g., \texttt{14\%} vs.\ \texttt{14.0}) can penalize model outputs unless carefully normalized.
88
+
89
+ % \noindent\textbf{ConvFinQA.}
90
+ % Multi-turn conversation combined with numerical reasoning yields some of the lowest performance (0.13--4.71\% exact match), emphasizing the gap between standard language modeling and the specialized reasoning flows required in real-world financial QA. Despite large parameter counts, most LMs falter in chaining multiple pieces of evidence across conversational turns.
91
+
92
+ % \subsection{Summary of Performance Trends}\label{sec:results-summary}
93
+
94
+ % In aggregate, \papertitle{} highlights the nuanced landscape of financial NLP. While classification and summarization tasks see moderate-to-high baseline scores, tasks requiring deeper understanding of domain-specific concepts (e.g., entity recognition, numeric QA, causal inference) remain unsolved challenges for current generation LMs. Surprisingly, \emph{model size alone} is not a reliable predictor of success, reinforcing the notion that a combination of targeted domain pre-training, explicit numeric/cellular reasoning, and prompt engineering is essential.
FLaME/content/6_conclusion.tex ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Conclusion}\label{sec:conclusions}
2
+ We present \papertitle, a robust evaluation framework as well as an open-source software package to easily conduct \textbf{\textit{holistic} evaluation of language models for finance}. \papertitle provides standardized multi-metric evaluation for finance-specific datasets and evaluation methods. This provides a valuable foundation for building, testing, and advancing high-performance NLP models tailored to the unique challenges of financial language understanding. We believe that the adoption of a collaborative evaluation framework like \papertitle will be used by researchers to easily conduct holistic evaluations of any generally available LM for core FinNLP tasks.\\
3
+
4
+ Our evaluation underscores the complex landscape of FinNLP. Our key insights are as follows:
5
+ \begin{enumerate}
6
+ \item No single LM outperforms all others across every task, but a few models \emdash namely Deepseek R1, OpenAI o1-mini, and Anthropic Claude 3.5 Sonnet \emdash demonstrate strong overall performance. Despite their capabilities, these large models come with significant cost trade-offs compared to smaller, more affordable alternatives.
7
+ \item Model performance varies significantly based on the domain and task structure, with notable differences observed between tasks such as summarization and multi-turn question answering.
8
+ \item Open-weight and mid-scale models such as DeepSeek-V3 and Llama 3.1 70B demonstrate a strong balance between cost-efficiency and performance, underscoring the need for further research to optimize their effectiveness in FinNLP.
9
+ \item There is a notable dearth of datasets across most languages and tasks within the taxonomy. The predominant languages in FinNLP remain English and Chinese.
10
+ \item The taxonomy is a collaborative and evolving framework that requires continuous expansion with additional tasks to adapt to the field's advancements.
11
+ \end{enumerate}
12
+
13
+ Key directions for future research include more advanced prompt engineering, domain-adaptive training (particularly for numeric/causal tasks), and benchmarking efficiency trade-offs. We hope these results guide both industry practitioners and NLP researchers in developing robust financial systems.
14
+
15
+ % \para{Real-World Deployment.} Applications in finance demand reliability not only in textual understanding but also in precise numeric operations under time constraints. Our results show that some smaller or specialized models can produce competitive performance at significantly lower computational cost, a vital factor for cost-sensitive or real-time environments.
16
+ % \para{Future Considerations.}
17
+ % Moving forward, we see three main lines of work: \emph{(i)} more rigorous evaluations that align with real-world tasks and data distributions, \emph{(ii)} exploration of domain-centric expansions or adapters (e.g., low-rank adapters for financial text), \emph{(iii)} robust prompt design that manages label sets, numeric precision, and multi-hop reasoning.
18
+ % \paragraph{Future Directions.}
19
+ % To address these shortcomings, we propose (1) fine-grained pre-training strategies that embed domain knowledge, (2) specialized prompting techniques for numeric reasoning, (3) more rigorous prompt engineering to reduce confusion in multi-class or multi-step outputs, and (4) improved data curation to mitigate contamination and mislabeled training samples.
20
+ % \subsection{Task-Specific Results and Insights}
21
+ % We now break down model performance across the main families of tasks in \papertitle.
FLaME/content/7_limitations.tex ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Limitations}\label{sec:limitations}
2
+ % \paragraph{limitations}: there is a larger problem associated with data leakage into the pretraining corpus. this happens when benchmark's testing data or labels are included in the corpus either accidentially during scraping, or worse, done so with the intent of improving performance on that benchmark. For that reason, we are actively working on novel datasets for tasks, allowing \papertitle and its community to keep a subset of the labeled data private and unpublished on the public internet. \papertitle will prefer tasks which feature novel data, or annotations for existing public data in order to minimize risks associated with model contamination. \papertitle will strongly prefer datasets and benchmarks which include a salted hash which help test for data contamination int he pre-training or fine-tuning corpus.
3
+ \papertitle has several notable limitations that should be acknowledged. First, there are many limitations to be noted that together could significantly impact the robustness and reliability of \papertitle. We discuss these limitations in extensive detail to illuminate the community on where we believe the most effort is needed for additional research. The recognition of incompleteness is a major requirement for holistic LM evaluation.
4
+ The limited size and diversity of datasets significantly affects our ability to measure the robustness and generalization of model performance across different scenario contexts. We highlight these areas of incompleteness with out taxonomy.
5
+ Budgets associated with computational cost were another major limiting favor for our study. In order to gather so many results from high-cost proprietary models, we conducted only zero-shot evaluations. We acknowledge the limitation of this research as techniques such as chain-of-thought and program-of-thought can significantly increase inference costs.
6
+ Adaptation (i.e. model prompting techniques) are not covered within this paper as the importance of in-context learning, structured analytical techniques, or evoking chains of 'reasoning' all are deserving of their own individual study. the benefit of these techniques has been noted and is worth of further research. the goals of our study are to focus on the zero-shot un-adapted and un-augmented performance of the selected foundation LMs.
7
+ We believe that existing research has demonstrated the benefits of these techniques enough to warrant widespread adoption and therefore allocated the computational budget towards exploring more models rather than prompt engineering.
8
+ Finally, the tasks associated with the first version of \papertitle all primarily rely on the English language due to English being the primary language of not only the authors, but of many FinNLP benchmarks. The focus on English for this \textbf{\textit{first iteration}} of \papertitle limits our ability to draw conclusions on multi-lingual performance for these models. However, the authors already have begun work to expand our benchmark to include multi-lingual coverage. Further, we solve for this limitation by establishing a living and community-governed benchmark for researchers to collaboratively build. We seek collaboration to work alongside other researchers to continually push for updates with new tasks and models. To assist others, we defined clearly and narrowly defined requirements for inclusion along with a standardized python implementation recipe to ensure fair evaluation in \refapp{datasets} and \refapp{ethicslegal}.
9
+ Despite our efforts to include a wide range of tasks, these datasets do not even begin to capture the breadth and complexity of human cognition required for real-world financial scenarios. The current tasks overlook many highly specialized use cases, local or regional knowledge, or emerging financial products or events. Finally, although \papertitle is easily extensible, the nature of changes in financial academics and practice means that benchmarks can lose their effectiveness. Modern financial economics undergoes rapid evolution and change. Due to this dynamic nature it is very difficult for any benchmark to capture the variability of out-of-sample data. By adopting a collaborative and extensible framework for our benchmark suite, we attempt to mitigate the risks associated with benchmarks becoming trivial to solve or irrelevant to current practice.\\
10
+ % Data Contamination \cite{Sainz2023-vf}
11
+ For a full in-depth discussion recognizing the incompleteness and the limits of this work, please refer to \refapp{incomplete}.
FLaME/content/8_ethics.tex ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ \section*{Ethics Statement}\label{sec:ethics}
2
+ All datasets and resources in this benchmark are used and shared per their respective licenses. We have audited the license of each included dataset and provided this information in our documentation. The ACL responsible research checklist recommends providing license or terms of use for any dataset or software artifact \cite{ACL2022-qf}. We follow this by explicitly stating each dataset's license (e.g., CC-BY, MIT, etc.) in our Appendix and documentation. We will update the final manuscript for publication to include all details on the leaderboard, an exploration of the user experience, and visualizations for metrics.
3
+ Finally, the authors of \papertitle disclaim and do not accept any liability for financial damages or losses associated with the use of the materials contained within this manuscript. This document and its related materials are only for academic and educational purposes. No commentary provided by the authors or this manuscript should used as financial, investing, or legal advice. Readers of our findings should seek the consultation of professionals before any use of these materials. Any use of our academic research constitutes indemnification of the authors against any claims from its use. Please see Appendix \ref{sec:appendixH} for further discussion on our research's ethics and legal aspects, along with our proposed collaboration's governance policies. During writing, the authors used AI tools, including ChatGPT, Gemini, and Claude, for writing assistance, editing, and LaTeX code generation. All usage was in accordance with ACL guidelines and limited to non-substantive tasks, such as formatting, grammar suggestions, and refining phrasing. No AI-generated text was included as original scientific contributions in this work.
FLaME/content/9_acknowledgements.tex ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ % \section*{Acknowledgments} \label{sec:acknowledgements}
2
+ % First, we would like to thank our anonymous reviewers for their comments and feedback.
3
+ % agam shah for his work on FLUE
4
+ % teghpreet for help during his special problems lmao
5
+ % Mark Riedl for council and advice on writing papers
6
+ % This work is supported in part by the \textsc{Funding source togetherAI}.
7
+ % G Do I have to acknowledge DARPA as my funding grant source on my papers?
8
+ % I want to thank god for all the losers and haters -- I appreciate you trying to keep me humble.
FLaME/content/appendices/appendix_datasets.tex ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Framework}\label{app:framework}
2
+ \para{Python Package.} We provide \papertitle as an open-source Python package under a Creative Commons Non-Commercial 4.0 License, offering the research community a \textit{generalizable} framework for reliable and reproducible evaluation of LMs on core NLP tasks for finance. \papertitle standardizes all steps of the evaluation process \dash downloading datasets, setting prompt templates, and computing metrics \dash such that researchers can fairly compare LMs on core NLP tasks across any selected scenario. Our software addresses prior issues of uncoordinated benchmarking by (1) making \emph{all} code, data, and results publicly available, (2) enforcing uniform data-loading pipelines, and (3) logging all inference parameters (e.g., temperature, context window) for transparency. We believe \papertitle will encourage more comprehensive study of new tasks, deeper error analysis, and rapid benchmarking of new models after release. We build our evaluation framework using LiteLLM, which acts as our “universal gateway” to bridge across any local inference engines or cloud API endpoint. This ensures identical prompting and evaluation logic for all models, regardless of whether the model is closed-source or open-weight.\\
3
+ \paragraph{Transparency and Reproducibility.}
4
+ Throughout, \papertitle stores complete metadata for every submission including model version, parameter count, datetime stamps, dataset versioning tags, evaluation settings, prompt templates, decoding parameters, and more. All final results (raw completions, logs, metrics) are compiled and serialized for secondary analysis and auditing. We aim to make \papertitle a trustworthy and collaborative anchor for ongoing financial LM research and take all steps needed to ensure the authenticity of all data used.
5
+ \section{Datasets}\label{app:datasets}
6
+ \para{Dataset Repository.} \papertitle also hosts a centralized repository of all benchmark datasets on HuggingFace \texttt{dataset} objects for consistent and immediate use by the community. We make these datasets available to users \textbf{\textit{only with the permission of the original authors}}. \papertitle boosts adoption by both academic and industry users by streamlines the evaluation process and (1) guaranteeing all evaluations use standardized formatting, (2) verifying correct annotation labels and dataset splits, and (3) facilitating future expansions by our community (e.g., new language coverage, updates to annotations, data de-duplication).\\
7
+ \subsection{Selection Criteria}
8
+ \para{Domain}: We require that a \textit{majority} of the dataset’s content be directly relevant to finance (e.g., investor filings, policy statements). Datasets that are only tangentially financial (e.g., general news with minor finance topics) are excluded.\\
9
+ \para{Purpose}: We do not include massive corpora intended purely for model \textit{pre-training} or fine-tuning. Instead, we focus on evaluating zero/few-shot performance of foundation LMs.\\
10
+ \para{Task Substance}: The dataset should exercise real finance knowledge or language capabilities (e.g., extracting risk factors, classifying research reports). Overly trivial tasks or single-label corpora are discouraged.\\
11
+ \para{Difficulty}: The dataset should not be trivial for state-of-the-art LMs, yet solvable by domain experts. This ensures the benchmark is challenging enough to reveal meaningful differences in model performance.\\
12
+ \para{Simplicity}: Where possible, tasks should be feed-forward (one input → one output) and not rely on elaborate prompt engineering. We want to measure foundational LM performance rather than specialized engineering hacks.\\
13
+ \para{License and Attribution}: Any dataset in \papertitle must allow open research use and provide attribution for original data authors.\\
14
+ \para{Fairness and Quality.} We require transparent sourcing (first-party or third-party) and minimal risk of label corruption or poor annotation. We strongly prefer tasks built on \textit{novel} data or curated expansions of existing public data to reduce the risk of model contamination.\\
15
+ \para{Bounded Complexity.} We target tasks suitable for foundational LMs in zero-shot settings rather than massive pre-training sets. Long or multi-document tasks must still fit practical LM context windows. For specialized tasks (e.g., advanced numeric forecasting from documents), we will extend our work in the future.\\
16
+ \subsection{Frontier Scenarios and Future Additions}
17
+ We identify multiple \textbf{\textit{frontier scenarios}}—reasoning-based tasks (mathematical or causal), decision-making (market forecasts), advanced knowledge (fact completion, cross-lingual QA), and more \citepHELM. These go beyond standard NLP tasks and often demand specialized labeling or multi-modal input. Our plan is to collaborate with domain experts and the broader community to gradually incorporate these frontiers into \papertitle.
18
+ \subsection{Data Quality Assurance}
19
+ \para{Data Integrity.} We conducted comprehensive validation to ensure that all datasets used in \papertitle were of acceptable quality for use. Before including a dataset, we conduct manual or semi-automated checks for label mismatch, duplicate entries, and incomplete annotations. If the dataset is well-documented and widely cited as reliable, we fast-track its inclusion.\\
20
+ \para{Community Collaboration.} We invite researchers to submit new datasets or highlight issues in existing ones. Our open GitHub issue tracker logs reported label noise, mismatch between dataset documentation and raw text, or potential duplication with a model’s training set. Our philosophy is that the best finance LM benchmark emerges from open-source commmunities and iterative improvement.\\
21
+ \para{Contamination Risks} Because finance data may appear in large pre-training corpora, we encourage dataset creators to embed “salted” verifiers (hash tokens). \papertitle aims to mitigate unintentional memorization or partial overlap in training data by carefully tracking dataset versions and urging the community to keep \textit{private} test splits off the open web.\\
22
+ \para{Datasets Excluded}
23
+ We identified concerns regarding certain datasets during our survey. For these reason we exclude datasets which are being flagged as concern by others. Label quality is a major factor in the selection of our datasets. We choose datasets where the quality of the datasets has not been noted by the community to have issues. datasets like the CRA NER dataset \cite{Alvarado2015-oe} has been noted by others \cite{Wu2023-ph,Wu2024-df,Lu2025-iq} as having quality issues with labels due to using a limited selection of only four entity types. Using only four entity types leads to a severely skewed distributions of entity types due to the limited data.\\
24
+ The appropriate use of datasets is important. we exclude datasets that focus on evaluating tabular time series data using a standard language model. there is reasons to believe and show interest in transformers and decoders as symbolic reasoners over time series numerical data, but language models are not trained for time series forecasting. As others have noted \cite{Wu2024-df} this type of data and task tend to be ineffective and not useful for understanding the capability of a language model to generate a forecast.\\
25
+ In addition we also exclude datasets that are (i) purely tabular/time-series data that lacks semantic meaning or human-readable text, (ii) proprietary or undisclosed corpora that are not shared publicly or verified, (iii) modified subsets of widely used corpora, if they do not offer new annotations or insights.
26
+ \subsection{Datasets}
27
+ \paragraph{Question Answering.}
28
+ \begin{itemize}
29
+ \item \input{content/datasets/finqa}
30
+ \item \input{content/datasets/convfinqa}
31
+ \item \input{content/datasets/tatqa}
32
+ \end{itemize}
33
+ \paragraph{Text Summarization.}
34
+ \begin{itemize}
35
+ \item \input{content/datasets/ectsum}
36
+ \item \input{content/datasets/edtsum}
37
+ \end{itemize}
38
+ \paragraph{Information Retrieval.}
39
+ \begin{itemize}
40
+ \item \input{content/datasets/finer}
41
+ \item \input{content/datasets/finentity}
42
+ \item \input{content/datasets/fnxl}
43
+ \item \input{content/datasets/finred}
44
+ \item \input{content/datasets/refind}
45
+ \end{itemize}
46
+ \paragraph{Sentiment Analysis.}
47
+ \begin{itemize}
48
+ \item \input{content/datasets/fiqa}
49
+ \item \input{content/datasets/fpb}
50
+ \item \input{content/datasets/subjectiveqa}
51
+ \item FiNER falls under Information Retrieval and Sentiment Analysis, see Information Retrieval section for the dataset information.
52
+ \end{itemize}
53
+ \paragraph{Text Classification.}
54
+ \begin{itemize}
55
+ \item \input{content/datasets/banking77}
56
+ \item \input{content/datasets/finbench}
57
+ \item \input{content/datasets/numclaim}
58
+ \item \input{content/datasets/headlines}
59
+ \item \input{content/datasets/fomc}
60
+ \end{itemize}
61
+ \paragraph{Causal Analysis.}
62
+ \begin{itemize}
63
+ \item \input{content/datasets/fincausal}
64
+ \end{itemize}
FLaME/content/appendices/appendix_ethicslegal.tex ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Ethics \& Legal}\label{app:ethicslegal}
2
+ \subsection{Dataset Attribution and Licensing}
3
+ All datasets included in our benchmark suite are appropriately credited to their original sources and used in compliance with their licenses. We emphasize proper citation for each dataset and strictly adhere to any usage restrictions stated by the dataset creators. Audit of AI benchmarks have found that lack of proper attribution is a \textbf{major} issue, with datasets missing the barest of license information and frequent (often self-serving) misattribution \cite{Longpre2023-ba, Longpre2024-it}.
4
+ \paragraph{Attribution and Citation:} Each dataset is accompanied by a citation to its original publication or official repository. In the benchmark documentation and this paper, we provide full references for every dataset, ensuring the original authors receive credit. When using or describing a dataset, we explicitly acknowledge its creators. This practice maintains academic integrity and helps others find the source of the data.
5
+ \paragraph{License Compliance:} For every dataset, we review the license to ensure our use conforms to its terms. Datasets released under permissive open-source licenses (e.g., MIT, CC BY) are incorporated with proper attribution and without modification to licensing. For datasets under more restrictive or non-commercial licenses (e.g., CC BY-NC), we restrict usage to research or other non-commercial purposes \cite{Creative-Commons2020-qd}. We clearly label each dataset with its license type in our documentation, and we include any required license text or attribution notices. Users of the benchmark are reminded to heed these licenses, meaning they should not engage in prohibited uses (such as commercial applications for CC BY-NC data) and must fulfill any requirements (such as attribution in publications).
6
+ \paragraph{Re-hosting with Permission:} We only re-host datasets when it is legal and ethical to do so. If a dataset’s license allows redistribution (or the dataset is public domain), we may mirror it on our platform (e.g., on the Hugging Face Hub or a project website) for convenient access. In such cases, we preserve the original content and license file, and include documentation about its provenance. If redistribution is not permitted by the license, we do \emph{not} host the raw data ourselves. Instead, we provide links, download scripts, or documentation for users to obtain the data directly from the original source, ensuring we respect the dataset owners’ rights. In some instances, we have obtained explicit permission from dataset creators to include their data in our benchmark package. All re-hosted data is provided in accordance with the original license terms and with clear attribution to the source.
7
+ \subsection{Collaboration Guidelines}
8
+ Our benchmark is a community-oriented project, and we welcome collaboration from external researchers who wish to contribute. To manage contributions effectively while maintaining high quality, we have established guidelines for those looking to add new datasets or improve existing ones. Below we outline how researchers can get involved, the criteria for accepting new datasets, and the process by which contributions are reviewed:
9
+ \paragraph{Contributing New Datasets:} External researchers can contribute datasets by following our open contribution process (detailed in the project repository). In practice, this means interested contributors should prepare their dataset in a standard format (including training/validation/test splits as appropriate and a clear description). They can then submit the dataset through a pull request on our GitHub repository or via an official submission form. Each submission should include essential documentation (e.g., a README or datasheet describing the dataset’s content, source, size, and license) and, if possible, a citation to a paper or source associated with the dataset. We also encourage contributors to upload the dataset to the Hugging Face Hub (or a similar platform) for easy integration, using a consistent naming scheme and providing a data card.
10
+ \paragraph{Acceptance Criteria:} To ensure quality and relevance, we evaluate each proposed dataset against several criteria before acceptance. First, the dataset must be clearly related to financial NLP (e.g., financial news analysis, risk report parsing, market question answering, etc.), adding coverage of a task that is valuable to the community. The data should be of high quality: for instance, annotations (labels, answers, etc.) should be correct and reliable, and the dataset should be of adequate size to support meaningful model evaluation. Datasets also need to have clear documentation of how they were collected and what they contain. Another crucial criterion is licensing and ethics: the dataset must have an appropriate license that at least allows research use (we cannot accept data with unknown or overly restrictive licenses), and it should not violate privacy or ethical norms (for example, we avoid proprietary data that was obtained without permission or data containing sensitive personal information). If a dataset fails to meet any of these criteria, we provide feedback to the contributor with suggestions for remediation (such as obtaining proper licensing or improving documentation).
11
+ \paragraph{Submission Review Process:} All dataset contributions undergo a review process overseen by the benchmark maintainers (and, if applicable, an advisory board of domain experts). When a contribution is submitted, the maintainers will verify the dataset’s format and integrity (ensuring it can be loaded and used in our evaluation pipeline), run basic quality checks, and assess the documentation and license. We also review a sample of the data to catch any obvious issues (like sensitive data that should be anonymized or mislabeled examples). If the dataset passes these checks, the maintainers discuss its fit for the benchmark. This often involves confirming that the dataset does not duplicate an existing resource and that it offers unique value. During review, the contributors might be contacted for clarifications or requested to make minor changes (for instance, to fix formatting or to add missing references). Once a dataset is approved, it is merged into the benchmark suite: we add it to our repository, include information about it in the official documentation (with credit to the contributors), and incorporate it into our benchmarking pipeline (so that models can be evaluated on it). Contributors of accepted datasets are acknowledged in the project to recognize their efforts.
12
+ \paragraph{Maintaining Quality and Updates:} Even after a dataset is accepted, we have guidelines to maintain the overall quality of the benchmark. We encourage continuous feedback from the community. If users of the benchmark identify issues with a dataset (such as label errors, formatting bugs, or ethical concerns that were overlooked), they can report these to the maintainers (for example, by opening an issue on GitHub). The maintainers will investigate and, if necessary, update or patch the dataset (in coordination with the original contributor when possible). We also periodically review the suite of datasets to see if any should be updated (for example, newer versions released by the original authors) or deprecated (if a better dataset for the same task becomes available or if usage of a dataset raises unforeseen problems). Through this collaborative and iterative process, we ensure the benchmark remains a living resource that stays relevant and trustworthy.
13
+ \subsection{Hosting Policies}
14
+ To maximize accessibility and ensure longevity, we host the benchmark’s datasets and results on reliable, open platforms. Our hosting strategy involves multiple channels: an online hub for datasets, a source code repository for the benchmark framework and results, and archival publications for permanence. Here we detail where the data and results are hosted and how users can access and cite them:
15
+ \paragraph{Dataset Repository and Access:} We provide public access to the datasets through the Hugging Face Datasets Hub and our project’s GitHub. Each dataset included in the benchmark (that is permitted to be shared) is uploaded as a dataset package on Hugging Face under an organizational account for the benchmark. This allows users to easily load the data using the \texttt{datasets} library (for example, via \texttt{load\_dataset("holiflame/dataset\_name")}). On each dataset’s Hugging Face page, we include a detailed description (dataset card) that notes the dataset’s source, contents, license, and citation instructions. For completeness, we also maintain a GitHub repository where we list all datasets and provide direct links or scripts. This is especially useful for datasets that cannot be hosted directly; for those, the repository contains a script (or instructions) to download the data from the original source. In all cases, accessing the data is free for research purposes, and no login or special permission is required beyond agreeing to the terms of the original licenses.
16
+ \paragraph{Benchmark Code and Results Hosting:} The code for running benchmark evaluations (including model evaluation scripts, metrics, and any wrappers around the datasets) is hosted on GitHub in the same repository that handles contributions. This repository serves as the central hub for development and version control. It includes documentation on how to run evaluations and reproduce the results from our paper. In addition to code, we host the benchmark results and leaderboards. For example, the repository (or an associated project webpage) contains tables of model performances on each dataset, updated as new models are evaluated. We plan to update these results over time and possibly integrate with the Papers with Code platform for an interactive leaderboard. To ensure results are archived for reference, we also include the main results in this paper’s Appendix and will release periodic reports (with DOIs) if the benchmark is extended significantly. Our initial benchmark results are part of this ACL paper (and thus stored on the ACL Anthology as a permanent record), and any future updates may be published in workshop proceedings or on arXiv to provide a citable reference.
17
+ \paragraph{Transparency and Peer Review:} All submissions are verified through automated scripts that verify legitimacy, parse outputs and compute metrics. This approach fosters peer review since all users can replicate results from previous submissions or highlight anomalies in existing model evaluations.} Users bring continuous updates as new models emerge \dash researchers can quickly add them to a living benchmark for financial NLP. We envision a community-run ecosystem where model owners, domain experts, and external contributors jointly expand \papertitle{}’s tasks, metrics, and data coverage\\
18
+ \para{Accessing and Citing Data:} We provide clear guidelines for how to access and use the benchmark data. Each dataset’s entry in our documentation explains the preferred access method (e.g., via Hugging Face or via our scripts). We also outline how to cite the data. Proper citation is twofold: users should cite this benchmark suite (to acknowledge the collection and any benchmark-specific curation) and also cite the original source of the dataset. In our documentation and in each dataset card on Hugging Face, we list the relevant citation (often the academic paper that introduced the dataset). Users of the benchmark are expected to include those citations in any publication or report that uses the benchmark. Additionally, when using or sharing the data, users must abide by the license terms attached to each dataset. This means, for instance, if a dataset is CC BY-NC, anyone reusing it should not use it commercially and should include the proper attribution in any derivative works. We make this information readily available to prevent any unintentional misuse. In summary, the data and results are openly accessible on popular platforms, and we provide extensive guidance on how to retrieve, cite, and leverage the benchmark materials in a responsible manner.\\
19
+ \subsection{Ethical Considerations}
20
+ Ethical compliance is a cornerstone of our benchmark design. In curating and releasing financial NLP datasets, we take care to respect privacy, obtain necessary consents, and promote fairness. We align our practices with the ACL ethics guidelines and broader community standards for handling data. Below, we discuss the ethical measures in place regarding data privacy, consent, bias, and overall responsible use of data:
21
+ \para{Data Privacy and Consent:} Many financial datasets involve text from reports, news, or social media, which generally pertain to companies or markets rather than private individuals. However, in cases where data might include personal or sensitive information (for example, customer reviews, financial advice communications, or user profiles in fraud detection data), we ensure that privacy is safeguarded. We only include such data if it has been made public with consent or properly anonymized. Specifically, if a dataset contains any personally identifiable information (PII), we verify that the data was collected with informed consent and that the individuals understood their data would be used for research. If this cannot be verified, the dataset is excluded or the PII is removed. Additionally, we avoid datasets that contain sensitive financial records of private individuals unless they are fully anonymized or synthetic. By taking these precautions, we uphold individuals’ privacy rights and comply with regulations and ethical norms around data protection.\\
22
+ \para{Bias and Fairness:} We recognize that datasets can inadvertently reflect biases (for example, a credit scoring dataset might over-represent certain demographics, or a financial news dataset might be predominantly from one country’s media). To address this, we encourage dataset contributors to document any known biases or limitations in their data. During the review process, we assess whether the dataset’s content could lead to biased models (such as bias against a group or region) and consider the diversity of the dataset. Our benchmark aims to cover a broad range of financial scenarios (including different markets, languages, and subdomains like banking, investment, insurance) to provide a balanced evaluation. When biases are unavoidable (as they often are in real-world data), we make them transparent: the documentation for each dataset notes aspects like the time period it covers, the geography or entities it focuses on, and any known skew. Users of the benchmark should be aware of these context details when interpreting results. Furthermore, we are committed to updating the benchmark with more diverse datasets over time, to improve fairness and representativeness across the financial NLP tasks.
23
+ \paragraph{Transparency and Data Documentation:} In line with principles of research transparency and reproducibility, we provide detailed documentation for every dataset in the benchmark. This includes a description of how the data was collected, what the data consists of (e.g., “10,000 financial news articles from 2010-2020, annotated with sentiment labels by experts”), and any preprocessing steps we performed (such as removing certain fields or normalizing text). We also clearly state the intended use of the dataset and any limitations. Each dataset entry is akin to a datasheet or card that enumerates its characteristics, ensuring that anyone using the dataset understands its context. If a dataset comes with specific usage restrictions or ethical considerations beyond the license (for example, a clause that one should not attempt to re-identify individuals mentioned in the data), we prominently communicate those conditions to the users. By providing this level of transparency, we help researchers use the data responsibly and enable them to explain their results with knowledge of the data’s nuances.
24
+ \paragraph{Compliance with Ethical Standards:} Our project abides by the ACL Code of Ethics and broader CS research ethical guidelines. This means that in assembling the benchmark, we have avoided any actions such as using data without permission, violating terms of service of websites, or including content that is derogatory or harmful without due reason. All team members and contributors are expected to follow ethical practices. For instance, if someone were to suggest adding a dataset obtained through web scraping a financial platform, we would require proof that this scraping did not violate the platform’s policies and that no confidential information is included. We also strive for transparency in our own work: any potential ethical issues we encountered during dataset collection or integration are disclosed in our documentation. In cases where we had doubts about a dataset’s ethical viability, we consulted with an ethics advisor or chose to err on the side of caution by not including that data. By enforcing these standards internally and for external contributions, we aim to set a positive example and ensure that the benchmark can be used freely without ethical reservations.
25
+ \subsection{Community Expectations}
26
+ Any benchmark suite's success relies on having a responsible community of users, contributors, and maintainers. We outline here what we expect from all parties involved to ensure the resource remains trustworthy, well-maintained, and useful for everyone. These expectations cover how data should be treated, how credit should be given, and how collaboration should occur in practice:
27
+ \paragraph{Responsible Use by Users:} Researchers and practitioners using the benchmark are expected to use the data and results responsibly. This means they should not misuse the datasets (for example, by trying to extract or infer private information about individuals from a dataset that has been anonymized) and should respect any usage guidelines provided. If a dataset is flagged as for non-commercial use only, users must refrain from deploying it in commercial products. Users should also be careful to preserve the integrity of the data: avoid altering datasets except for necessary preprocessing, and certainly do not modify labels or data points in a way that could mislead results. If a user discovers an issue in a dataset (such as a systematic labeling error or a broken link), we expect them to inform the maintainers via the appropriate channel (GitHub issue, email, etc.) so that it can be addressed for the benefit of all.
28
+ \paragraph{Proper Citation and Acknowledgment:} We expect all users of the benchmark to give proper credit in their publications or projects. At minimum, this involves citing this benchmark (the ACL paper or associated technical report) as the source of the evaluation suite, as well as citing the original sources of any datasets used. Proper citation not only acknowledges the work of the benchmark organizers and dataset creators, but also allows others to trace back to the original data for verification or further research. In our benchmark documentation, we provide a BibTeX entry for the benchmark itself and recommend citation strings or references for each dataset. When writing a paper that uses, say, the FiQA sentiment analysis dataset from our suite, the author should cite the FiQA paper in addition to our benchmark paper. This practice is in line with community norms and some dataset licenses that mandate attribution. Users should also acknowledge any tools or baseline results from the benchmark if they directly use them.
29
+ \paragraph{Contributor and Maintainer Responsibilities:} Contributors who add datasets or code are expected to maintain a high standard of quality and ethics. They should only contribute data that they have the right to share and that meets the criteria outlined above. Contributors are also encouraged to remain engaged after their dataset is added, in case updates or fixes are needed. On the other side, maintainers (the core team overseeing the benchmark) have the responsibility to manage contributions fairly and efficiently. They should provide constructive feedback to contributors, merge accepted contributions in a timely manner, and update documentation accordingly. Maintainers are also responsible for monitoring the health of the project – if a dataset becomes unavailable or if a license changes, the maintainers must act (e.g., by finding an alternative hosting solution or removing the dataset if it no longer can be shared). Both contributors and maintainers should adhere to a code of conduct that emphasizes respectful communication, openness to feedback, and collaborative problem-solving. Any disputes (for example, if a contribution is deemed unsuitable) should be handled transparently and with courtesy.
30
+ \paragraph{Community Collaboration:} We foster an open community environment. Users are encouraged to share their experiences with the benchmark, such as posting results, writing tutorials, or comparing models, in forums or social media, as long as they credit the source. We have set up a discussion board (or use an existing platform like the Hugging Face forums or a Discord channel) for the benchmark where people can ask questions, suggest improvements, or seek help. The expectation is that community members will help each other, making the benchmarking process easier and more standardized. For example, if someone has trouble using a particular dataset, others who have used it can chime in with advice. This kind of peer support is invaluable. We ask that all community interactions remain professional and focused on the science – harassment, discrimination, or any form of unprofessional behavior is not tolerated. By cultivating a friendly and inclusive atmosphere, we hope to attract a wide range of contributors and users, which in turn makes the benchmark more robust and widely applicable.
31
+ \paragraph{Extending and Evolving the Benchmark:} The benchmark is not a static resource; we expect it to evolve as the field progresses. Community members who identify gaps in the benchmark (for instance, a new type of financial NLP task that is not covered) are encouraged to propose extensions. This could include new datasets, new evaluation metrics, or even new challenge tasks. When doing so, we expect the same level of rigor as for the initial benchmark: thorough documentation, ethical data handling, and openness to peer review. If researchers create their own extension of the benchmark for private use (say, adding proprietary data for an internal evaluation), we of course cannot enforce the same rules, but we encourage them to share their insights or tools with the community whenever possible. Should any such extensions be made public, we hope the creators will merge efforts with us so that the community has a unified benchmark rather than many fragmented ones. In summary, every user and contributor has a role in upholding the integrity of the benchmark. By using the data conscientiously, citing sources, contributing improvements, and collaborating respectfully, the community ensures that this benchmark remains a valuable asset for financial NLP research now and in the future.
FLaME/content/appendices/appendix_incomplete.tex ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Recognition of Incompleteness}\label{app:incomplete}
2
+ \subsection{What is Missing}
3
+ Given the large number of foundation language models, it became financially infeasible for us to conduct a thorough study of every dataset we have identified and classified in our taxonomy within a single paper. The \papertitle leaderboard is intended as a collaborative community effort, which we plan to update continuously as we gather more data on these foundation models.
4
+ \subsection{What Was Not Considered}
5
+ Aspects of artificial intelligence systems beyond the foundational language model are not within the scope of our study. For instance, systems such as knowledge graphs
6
+ % \cite{Sarmah2024-an, Lo2024-nr, Zhang2023-eg, Pan2022-yv, Lewis2020-tw}
7
+ ,
8
+ retrieval-augmented generation (RAG)
9
+ % \cite{Sarmah2024-an, Lo2024-nr, Zhang2023-eg, Pan2022-yv, Lewis2020-tw}
10
+ ,
11
+ and various hybrid approaches have been shown to be beneficial in finance. However, datasets or benchmarks that focus on RAG are excluded because they assess factors beyond the language model itself (e.g., embedding quality, vector selection, and specialized metrics). Similar considerations apply to knowledge graphs. These aspects of AI systems have been explored in previous research, and we believe they deserve dedicated studies of their own.
12
+ \subsection{Frontier Scenarios}\label{app:frontiertasks}
13
+ Beyond our core set of NLP tasks \refsec{taxonomy}, we recognize a broader class of \textbf{frontier scenarios} that lie outside the scope of \papertitle's current evaluation. Each of these frontiers reflects emerging or highly specialized challenges in finance. We envision these domains as a natural extension for future research, requiring not only specialized datasets but also domain-specific metrics, rigorous protocols, and potentially interdisciplinary expertise. While \papertitle currently focuses on fundamental NLP tasks (e.g., QA, summarization, sentiment analysis), evaluating these frontier tasks deserves more thorough study and further discussion.
14
+ \paragraph{(1) Reasoning.}
15
+ Robust multi-step reasoning is crucial in finance, from mathematical and logical derivations (e.g., portfolio optimization, derivatives pricing) to causal and counterfactual reasoning (e.g., modeling how regulatory changes might affect stock prices). Structured data reasoning and code synthesis also figure prominently in automated financial analysis, such as generating scripts for data cleaning or computing risk metrics. Despite their importance, we omit these tasks in our current benchmark because:
16
+ \begin{enumerate}
17
+ \item They often demand carefully labeled multi-step annotations (e.g., detailed solution outlines for financial math problems).
18
+ \item They rely on domain-specific metrics that go well beyond typical F1 or BLEU scores (e.g., verifying the correctness of an interest-rate calculation, or confirming that code compiles and produces the right financial outputs).
19
+ \item They can require domain experts to judge the validity of reasoning steps, significantly increasing the cost of dataset creation and evaluation.
20
+ \end{enumerate}
21
+ \paragraph{(2) Knowledge.}
22
+ Tasks such as \emph{fact completion}, \emph{knowledge-intensive QA}, and \emph{critical reasoning} are pivotal in scenarios requiring specialized financial intelligence. A language model might need to recall policy clauses or legal precedents relevant to specific industry regulations, or integrate large-scale macroeconomic knowledge to answer multi-domain questions (e.g., “How do rising interest rates influence credit default swaps?”). Constructing comprehensive knowledge-focused evaluations in finance poses challenges such as:
23
+ \begin{enumerate}
24
+ \item \textbf{Coverage:} Maintaining an up-to-date repository of financial facts (e.g., corporate structures, compliance rules) is daunting due to constant changes in markets and regulatory environments.
25
+ \item \textbf{Verification and Fact-Checking:} Complex financial facts often demand external references (e.g., official filings), and verifying correctness is non-trivial.
26
+ \end{enumerate}
27
+ \paragraph{(3) Decision-Making.}
28
+ Finance ultimately revolves around decision-making tasks such as \emph{market forecasting}, \emph{risk management}, \emph{stock-movement prediction}, and \emph{credit scoring}. These activities often combine numerical time-series modeling with textual signals (e.g., news articles, analyst reports) and may include advanced simulation or reinforcement-learning techniques (e.g., algorithmic trading strategies). Because these tasks are \textbf{high-stakes} and multi-modal (texts, tables, time-series), we have excluded them from \papertitle. Properly benchmarking decision-oriented tasks involves:
29
+ \begin{enumerate}
30
+ \item Access to real-time or historical \emph{structured} financial data (e.g., stock price feeds).
31
+ \item Well-defined metrics that can meaningfully assess predictive accuracy or risk-adjusted returns.
32
+ \item Potential integration of ethical and legal constraints (e.g., insider trading regulations).
33
+ \end{enumerate}
34
+ \paragraph{(4) Human Alignment.}
35
+ Large language models can inadvertently propagate harmful behaviors—e.g., misinformation, social biases, or privacy violations. In finance, these concerns become critical due to the potential for \emph{disinformation} (fake financial news), \emph{toxic content} (harassment in investor forums), or \emph{privacy breaches} in sensitive customer data. Addressing alignment means ensuring LLMs are \emph{honest, harmless, and helpful} in financial contexts. It also covers memorization of sensitive data (e.g., replicating personal credit history) and copyrighted materials. Each topic warrants extensive research:
36
+ \begin{enumerate}
37
+ \item \textbf{Social Bias and Toxicity}: Minimizing harmful language and misinformation.
38
+ \item \textbf{Privacy and Copyright}: Preventing models from disclosing proprietary or regulated information.
39
+ \item \textbf{Regulatory Compliance}: Evolving laws may require auditing an LLM’s data usage or output content.
40
+ \end{enumerate}
41
+ \paragraph{(5) Multi-Modal.}
42
+ Many real financial workflows rely on data that is not purely text—e.g., Excel spreadsheets, visual charts, scanned PDF statements, or contract images. Tasks like \emph{table-based QA}, \emph{tool use} (e.g., integrative question answering with Python or R scripts), and \emph{visual analysis} (e.g., reading corporate diagrams or trade forms) are vital for practical applications. However, true multi-modal setups typically require:
43
+ \begin{enumerate}
44
+ \item Specialized architectures or bridging modules that fuse text with tabular or image data.
45
+ \item Domain-adapted evaluation methods (e.g., metrics for chart-based questions).
46
+ \item Substantial cross-disciplinary expertise to annotate or interpret financial images and tables consistently.
47
+ \end{enumerate}
48
+ As such, we limit \papertitle to text-only tasks for its initial release, but we envision future expansions that incorporate multi-modal data sources in an end-to-end benchmarking pipeline.
49
+ \paragraph{Call for Collaboration}
50
+ Despite excluding these frontier domains from our initial evaluation suite, we emphasize that each is critical for a holistic understanding of AI in finance. We invite the community to develop specialized datasets, metrics, and tools that address these open challenges—whether involving advanced reasoning about financial instruments, building robust knowledge graphs of regulatory clauses, or evaluating alignment with compliance frameworks. Over time, we aim to integrate such expansions into \papertitle so that practitioners can measure model capabilities comprehensively on the most relevant, contemporary tasks.
51
+ % \section{Recognition of Incompleteness}\label{app:incomplete}
52
+ % \subsection{What is missing}
53
+ % With the largest number of foundation language models trained on it became financially infeasible for us to conduct a thorough study of ever dataset we have identified and classified within our taxonomy for a single paper. The \papertitle leaderboard is designed to be a collective community effort, which we continue to update continuously as we continue in the future to collect data on these foundational models.\\
54
+ % \subsection{what was not considered}
55
+ % Aspects of artificial intelligence systems beyond the foundational language model are not within the scope of our study. Systems such as Knowledge Graphs , and combinations of the two [*,*,*] have been demonstrated [*,*,*] to show a great deal of use for finance, economics, and business. Datasets or benchmarks related to retrieval augmented generation (RAG) are not included, as these are assessing aspects beyond the foundation language model such as embedding quality, vector selection, and novel metrics for RAG, or [x,y,z] for KGs. These aspects of AI systems have seen many past studies [*,*,*] and we believe that they deserve dedicated research studies.\\
56
+ % \subsection{Frontier Scenarios}\label{sec:appendixG:frontier}
57
+ % Beyond our core set of NLP tasks (\S\ref{sec:taxonomy:task}), we recognize a broader class of \textbf{frontier scenarios} that lie outside the scope of \papertitle's current evaluation. Each of these frontiers reflects emerging or highly specialized challenges in finance. We envision these domains as a natural extension for future research, requiring not only specialized datasets but also domain-specific metrics, rigorous protocols, and potentially interdisciplinary expertise. While \papertitle currently focuses on fundamental NLP tasks (e.g., QA, summarization, sentiment analysis), we believe these evaluating these frontier tasks deserves more thorough study and further discussion.\\
58
+ % \para{(1) Reasoning.}
59
+ % Robust multi-step reasoning is a critical need in finance, ranging from mathematical and logical derivations (e.g., portfolio optimization, derivatives pricing) to causal and counterfactual reasoning (e.g., modeling how regulatory changes might affect stock prices). In addition, structured data reasoning and code synthesis figure prominently in automated financial analysis, such as generating scripts for data cleaning or computing risk metrics.
60
+ % Despite their importance, we omit these tasks in our current benchmark because:
61
+ % \begin{enumerate}
62
+ % \item They often demand carefully labeled multi-step annotations (e.g., detailed solution outlines for financial math problems).
63
+ % \item They rely on domain-specific metrics that go well beyond typical F1 or BLEU scores (e.g., verifying a derived interest rate formula’s correctness, or confirming that code compiles and produces the right financial calculation).
64
+ % \item They can require domain experts to judge the validity of reasoning steps, significantly raising the bar for dataset creation and evaluation.
65
+ % \end{enumerate}
66
+
67
+ % \para{(2) Knowledge.}
68
+ % Tasks such as \emph{fact completion}, \emph{knowledge-intensive QA}, and \emph{critical reasoning} are pivotal in scenarios requiring specialized financial intelligence. A language model might need to recall policy clauses or legal precedents relevant to specific industry regulations, or it might integrate large-scale macroeconomic knowledge to answer multi-domain questions (e.g., “How do rising interest rates influence credit default swaps?”).
69
+ % However, constructing comprehensive knowledge-focused evaluations in finance poses substantial challenges, including:
70
+ % \begin{enumerate}
71
+ % \item \textbf{Coverage:} Maintaining an up-to-date repository of financial facts (corporate structures, compliance rules) can be daunting due to constant changes in the markets and regulatory environments.
72
+ % \item \textbf{Verification and Fact-Checking:} Complex financial facts often demand external references (e.g., official filings), and verifying correctness is non-trivial.
73
+ % \end{enumerate}
74
+
75
+ % \paragraph{(3) Decision-Making.}
76
+ % Finance ultimately revolves around decision-making tasks such as \emph{market forecasting}, \emph{risk management}, \emph{stock movement prediction}, and \emph{credit scoring}. These activities often blend numerical time-series modeling with textual signals (e.g., news articles, analyst reports) and may include advanced simulation or reinforcement learning techniques (e.g., algorithmic trading strategies).
77
+ % Because these tasks are \textbf{high-stakes} and multi-modal (texts, tables, time-series), we have currently excluded them from \papertitle. Properly benchmarking decision-oriented tasks involves:
78
+ % \begin{enumerate}
79
+ % \item Access to real-time or historical \emph{structured} financial data (e.g., stock price feeds).
80
+ % \item Well-defined metrics that can meaningfully assess predictive accuracy or risk-adjusted returns.
81
+ % \item Potential integration of ethical and legal constraints (e.g., insider trading regulations).
82
+ % \end{enumerate}
83
+ % \paragraph{(4) Human Alignment.}
84
+ % Large language models can inadvertently propagate harmful behaviors—e.g., misinformation, social biases, or privacy violations. In finance, these concerns become critical due to the potential for \emph{disinformation} (fake financial news), \emph{toxic content} (harassment in investor forums), or \emph{privacy breaches} in sensitive customer data.
85
+ % Addressing alignment means ensuring LLMs are \emph{honest, harmless, and helpful} in financial contexts. It also covers memorization of sensitive data (e.g., replicating personal credit history) and copyrighted materials. These topics each warrant extensive research:
86
+ % \begin{enumerate}
87
+ % \item \textbf{Social Bias and Toxicity}: Minimizing harmful language that contains misleading or false information.
88
+ % \item \textbf{Privacy and Copyright}: Preventing models from disclosing proprietary or regulated information.
89
+ % \item \textbf{Regulatory Compliance}: Evolving laws may demand auditing an LLM’s data usage or output content.
90
+ % \end{enumerate}
91
+ % \paragraph{(5) Multi-Modal.}
92
+ % Finally, many real financial workflows rely on data that is not purely text—e.g., Excel spreadsheets, visual charts, scanned PDF statements, or contract images. Tasks like \emph{table-based Q\&A}, \emph{tool use} (e.g., integrative question answering with Python or R scripts), and \emph{visual analysis} (e.g., reading corporate diagrams or trade forms) are vital for practical applications.
93
+ % However, true multi-modal setups typically require:
94
+ % \begin{enumerate}
95
+ % \item Specialized architectures or bridging modules that fuse text with tabular or image data.
96
+ % \item Domain-adapted evaluation methods (e.g., how to measure correctness in a chart-based question).
97
+ % \item Substantial cross-disciplinary expertise to annotate or interpret financial images and tables consistently.
98
+ % \end{enumerate}
99
+ % As such, we deliberately limit \papertitle to text-only tasks for its initial release, but we envision future expansions that incorporate multi-modal data sources in an end-to-end benchmarking pipeline.\\
100
+ % \para{Call for Collaboration}
101
+ % Despite excluding these frontier domains from our initial evaluation suite, we emphasize that each is critical for a truly holistic understanding of AI for finance. We invite the community to develop specialized datasets, metrics, and tooling that address these open challenges \dash whether it involves advanced reasoning about financial instruments, building robust knowledge graphs of regulatory clauses, or evaluating alignment with compliance frameworks. Over time, we aim to integrate such expansions into \papertitle so that practitioners can measure model capabilities in high fidelity on the most relevant contemporaneous tasks.
FLaME/content/appendices/appendix_models.tex ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Models} \label{app:models}
2
+ In this section, we present Table \ref{tab:model-detail}, which details the various models evaluated on the benchmarks along with the associated evaluation costs.
3
+ % \paragraph{Meta Family.}
4
+ % \begin{itemize}
5
+ % \item \textbf{Llama 3 70B Instruct Reference} \cite{meta-llama3}: A 70-billion parameter instruction-tuned model from Meta's Llama 3 series, optimized for complex instruction-following tasks.
6
+ % \item \textbf{Llama 3 8B Instruct Reference} \cite{meta-llama3}: An 8-billion parameter variant of Meta's Llama 3 series, designed for efficiency in instruction-based tasks.
7
+ % \item \textbf{LLaMA-2 Chat (13B)} \cite{meta-llama2}: A 13-billion parameter conversational model from Meta's LLaMA-2 series, optimized for dialogue.
8
+ % \end{itemize}
9
+ % \paragraph{Databricks Family.}
10
+ % \begin{itemize}
11
+ % \item \textbf{DBRX Instruct} \cite{databricks-dbrx}: Developed by Databricks, this model is fine-tuned for instruction-following, leveraging the Mosaic AI platform.
12
+ % \end{itemize}
13
+ % \paragraph{DeepSeek AI Family.}
14
+ % \begin{itemize}
15
+ % \item \textbf{DeepSeek LLM Chat (67B)} \cite{deepseek-llm}: A 67-billion parameter chat-optimized model from DeepSeek AI.
16
+ % \item \textbf{DeepSeek-V3} \cite{deepseek-v3}: A newer model from DeepSeek AI with improved capabilities, though specific details remain limited.
17
+ % \item \textbf{DeepSeek R1} \cite{deepseek-r1}: Another variant from DeepSeek AI, designed for specialized NLP tasks.
18
+ % \end{itemize}
19
+ % \paragraph{Google Family.}
20
+ % \begin{itemize}
21
+ % \item \textbf{Gemma 2 27B} \cite{google-gemma}: A 27-billion parameter model from Google's Gemma 2 series, optimized for NLP applications.
22
+ % \item \textbf{Gemma 2 9B} \cite{google-gemma}: A smaller, 9-billion parameter variant of the Gemma 2 series.
23
+ % \item \textbf{Google Gemini 1.5 Pro} \cite{Gemini-Team2024-na}: A highly capable model in Google's Gemini series, though specific details about its architecture remain undisclosed.
24
+ % % \item \textbf{Google Gemini 2.0 Flash} \cite{google-gemini}: An optimized iteration in the Gemini series.
25
+ % \end{itemize}
26
+ % \paragraph{Mistral AI Family.}
27
+ % \begin{itemize}
28
+ % \item \textbf{Mistral (7B) Instruct v0.3} \cite{mistral-7b}: A 7-billion parameter instruction-tuned model from Mistral AI.
29
+ % \item \textbf{Mixtral-8x22B Instruct (141B)} \cite{mistral-mixtral}: A mixture of experts model with eight 22-billion parameter sub-models.
30
+ % \item \textbf{Mixtral-8x7B Instruct (46.7B)} \cite{mistral-mixtral}: A mixture of eight 7-billion parameter sub-models.
31
+ % \end{itemize}
32
+ % \paragraph{Qwen Family.}
33
+ % \begin{itemize}
34
+ % \item \textbf{Qwen 2 Instruct (72B)} \cite{qwen-2}: A 72-billion parameter instruction-following model.
35
+ % \item \textbf{Qwen QwQ-32B-Preview} \cite{qwen-qwq}: A 32-billion parameter preview model.
36
+ % \end{itemize}
37
+ % \paragraph{WizardLM Family.}
38
+ % \begin{itemize}
39
+ % \item \textbf{WizardLM-2 8x22B} \cite{wizardlm-2}: A mixture of eight 22-billion parameter sub-models optimized for instruction-following.
40
+ % \end{itemize}
41
+ % \paragraph{Jamba Family.}
42
+ % \begin{itemize}
43
+ % \item \textbf{Jamba 1.5 Mini} \cite{jamba-1.5}: A smaller version of AI21 Labs' Jamba 1.5.
44
+ % \item \textbf{Jamba 1.5 Large} \cite{jamba-1.5}: A larger variant of Jamba 1.5.
45
+ % \end{itemize}
46
+ % \paragraph{Claude Family.}
47
+ % \begin{itemize}
48
+ % \item \textbf{Claude 3.5 Sonnet} \cite{claude-3.5}: A version of Anthropic's Claude model optimized for advanced reasoning.
49
+ % \item \textbf{Claude 3 Haiku} \cite{claude-3}: A smaller, efficiency-focused model in the Claude series.
50
+ % \end{itemize}
51
+ % \paragraph{Cohere Family.}
52
+ % \begin{itemize}
53
+ % \item \textbf{Cohere Command R 7B} \cite{cohere-commandr}: A 7-billion parameter model focused on instruction-following.
54
+ % \item \textbf{Cohere Command R+} \cite{cohere-commandr}: An improved version of the Command R model.
55
+ % \end{itemize}
56
+ % \paragraph{OpenAI Family.}
57
+ % \begin{itemize}
58
+ % \item \textbf{GPT-4o} \cite{openai-gpt4o}: A highly advanced multimodal model from OpenAI.
59
+ % \item \textbf{OpenAI o1-mini} \cite{openai-o1}: A compact model from OpenAI, details remain limited.
60
+ % \end{itemize}
61
+ \include{content/tables/table_models_2}
FLaME/content/appendices/appendix_prompting.tex ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Prompting}\label{app:prompting}
2
+ In this section, we provide details on how we prompt foundation LMs for \title evaluations.
3
+ \subsection{Formatting Test Instances}\label{sec:prompt-test}
4
+ \para{Language Model} For most \emph{language model} (LM) scenarios the prompt is simply the input, and there is no reference. If documents in LM datasets are longer than the model's window size, we tokenize documents using each model's corresponding tokenizer (if known), and segment the resulting token sequences according to the model's window size.\\
5
+ \para{Truncation.} For scenarios where test instances exceed a model's window size, we truncate the input to fit within the model's context window. This ensures consistency across different models without requiring reassembly of output fragments.\\
6
+ \para{Multiple Choice.} For multiple choice scenarios, each instance consists of a question and several possible answer choices (typically with one marked as correct). Rather than asking an LM to directly predict the probability distribution over answer choices, we use a structured prompting approach for LM output.
7
+ We implement multiple-choice adaptation using the \textit{joint} approach \citep{Hendrycks2020-rz}, where all answer choices are concatenated with the question (e.g., ``\texttt{ A. <choice 1> B. <choice 2> Answer:}'') and the LM is prompted to respond with the correct or most probable answer. We default to using the joint approach unless other work has established a preferable method for a specific benchmark.
8
+ \subsection{Formatting the Remainder of the Prompt}\label{sec:prompt-remainder}
9
+ \para{Prompt Construction.} LM prompts can also provide concise instructions or prefixes that clarify the expected model behavior. Recent work has thoroughly demonstrated that prompt design \textit{significantly} affects performance \citep{Le-Scao2021-uy, Wei2022-vi, Yao2023-zo, Besta2023-ft, Schulhoff2024-ar}. Rather than optimizing prompts to maximize performance \citep{Khattab2022-ag, Opsahl-Ong2024-zj, Yuksekgonul2024-yu, Schulhoff2024-ar}, we prioritize the use of naturalistic prompting to reflect realistic co-creative interactions between humans and computers \cite{Lin2023-qx, Lin2023-by}.\\
10
+ \subsection{Parameters}\label{sec:prompt-parameters}
11
+ Once the test instance (\refsec{prompt-test}) and prompt (\refsec{prompt-remainder}) are specified, we define the decoding parameters to generate model completions. Example of parameters include the the temperature value, specific stop tokens, and the number of completions.
12
+ \para{Temperature.} The temperature controls randomness in decoding: a temperature of $0$ corresponds to deterministic decoding, while a temperature of $1$ corresponds to probabilistic sampling from the model's distribution. We use temperature-scaling for scenarios requiring diverse outputs but set the temperature to zero for tasks demanding deterministic behavior (\ie classification tasks).\\
13
+ \para{Stop Token.} Aside from the LM-specific context length limitations, we specify a stop condition by specifying specific stop tokens as well as the maximum number of tokens to be generated. Stop sequences are preferred over tokens for model-agnostic adaptation. We use a standardized max token limit based on expected length of the reply for each scenario to prevent excessive token generation during completion.\\
14
+ \para{Number of Outputs.} Outputs from LM not stochastic with zero temperature settings. For most scenarios, we use deterministic decoding (temperature $0$), and a single output per input suffices. However, for metrics and scenarios analyzing output distributions, we need to generate multiple outputs to gather a sufficient sample. By default, the number of outputs per input is $1$ for all of the initial evaluations done for \papertitle.
FLaME/content/appendices/appendix_relatedwork.tex ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Related Work}\label{app:relatedwork}
2
+ Two early benchmarks for financial NLP are FLUE \citepFLUE and FLARE \citepFLARE. While they introduced multiple tasks (e.g., sentiment analysis, named entity recognition) relevant to financial contexts, they often focused on \emph{a limited set of datasets} and a \emph{single metric} for each task (e.g., F1 or accuracy). These suites did not formally acknowledge the \emph{incompleteness} of their coverage—neglecting many possible financial scenarios such as numerical QA, multi-step reasoning, or specialized regulatory text analysis. Additionally, they offered no standardized pipeline to evaluate \emph{foundation} LMs in a reproducible manner, instead often benchmarking only a few custom or fine-tuned models.
3
+ There are prior benchmarks for financial scenarios such as Golden Touchstone \citep{Wu2024-df}, CFBenchmark \cite{Lei2023-ox}, and InvestorBench \cite{Li2024-fg}, BizBench, \citep{Koncel-Kedziorski2023-fx}, and FinanceBench \cite{Islam2023-dl} to name a few. These works often cover only a small handful of tasks without broad inference coverage, lack a holistic scenario-based taxonomy, or focus on a specialized and narrow task (\ie, financial question answering for tables). Other recent attempts \citeFinBen collect multiple financial datasets and occasionally implement limited software tooling for standardizing evaluations. However, several significant limitations remain:
4
+ \begin{itemize}
5
+ \item They do \emph{not} explicitly define \emph{holistic} methodologies akin to HELM, instead treating each dataset largely in isolation.
6
+ \item They typically rely on \emph{narrow} evaluation metrics (e.g., rule-based label extraction) that fail to capture the variety of ways a model can output correct information or demonstrate robust reasoning.
7
+ \item Many benchmarks focus on \emph{fine-tuned} models for specific tasks, rather than evaluating a broad range of \emph{foundation LMs} under standardized conditions.
8
+ \item They do not propose \emph{living} frameworks or a public leaderboard that invite ongoing community contributions.
9
+ \end{itemize}
10
+ For example, \citepFinBen provides a large collection of financial datasets bundled with a software package for model evaluation but does not address multi-metric scoring or unify the results consistently and transparently. The authors also do not define or adhere to explicit \emph{fair and open standards} for dataset selection, and they primarily focus on performance metrics that rely on simple rule-based matching of outputs. Hence, \citepFinBen \emph{never identifies its incompleteness} or encourages the broader community to fill those gaps.
11
+ These domain-specific benchmarks, including \citepFinBen, highlight a growing interest in finance-focused NLP but consistently fall short of fulfilling \emph{holistic} standards (see Table \ref{tab:us-vs-them}). They seldom perform multi-metric analysis, fail to account for the breadth of possible financial use cases, and rarely provide open-ended frameworks for ongoing updates. This gap becomes especially problematic as LMs are increasingly deployed in real-world financial settings, where mistakes can lead to high-impact consequences.
12
+ By comparison, our proposed \papertitle novel framework is the first for finance to satisfy all \textbf{three pillars} of holistic evaluation \dash (1) standardized evaluations, (2) multi-metric assessment, and (3) explicit recognition of incompleteness \cite{Liang2022-ew}. By releasing a \emph{living benchmark} complete with code, data curation, and a public leaderboard, we aim to \textbf{(i)} unify existing financial datasets under clear inclusion criteria, \textbf{(ii)} evaluate foundation LMs in a transparent and reproducible way, and \textbf{(iii)} foster an evolving ecosystem where researchers can collectively expand the benchmark to new tasks or languages over time.
FLaME/content/appendices/appendix_results.tex ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Results}\label{app:results}
2
+ \subsection{Extended Results}\label{app:extendedresults}
3
+ \begin{table*}[h!]
4
+ \centering
5
+ \resizebox{\textwidth}{!}{
6
+ \input{content/tables/by_task/text_classification}
7
+ }
8
+ \caption{Text Classification Table}
9
+ \label{tab:text_classification}
10
+ \end{table*}
11
+ \begin{table*}[h!]
12
+ \centering
13
+ \resizebox{\textwidth}{!}{
14
+ \input{content/tables/by_task/information_retrieval}
15
+ }
16
+ \caption{Information Retrieval Table}
17
+ \label{tab:information_retrieval}
18
+ \end{table*}
19
+ \begin{table*}[h!]
20
+ \centering
21
+ \resizebox{0.5\textwidth}{!}{
22
+ \input{content/tables/by_task/question_answering}
23
+ }
24
+ \caption{Question Answering Table}
25
+ \label{tab:question_answering}
26
+ \end{table*}
27
+ \begin{table*}[h!]
28
+ \centering
29
+ \resizebox{\textwidth}{!}{
30
+ \input{content/tables/by_task/sentiment_analysis}
31
+ }
32
+ \caption{Sentiment Analysis Table}
33
+ \label{tab:sentiment_analysis}
34
+ \end{table*}
35
+ \begin{table*}[h!]
36
+ \centering
37
+ \resizebox{\textwidth}{!}{
38
+ \input{content/tables/by_task/text_summarization}
39
+ }
40
+ \caption{Text Summarization Table}
41
+ \label{tab:text_summarization}
42
+ \end{table*}
43
+ \begin{table*}[h!]
44
+ \centering
45
+ \resizebox{\textwidth}{!}{
46
+ \input{content/tables/by_task/causal_analysis}
47
+ }
48
+ \caption{Causal Analysis Table}
49
+ \label{tab:causal_analysis}
50
+ \end{table*}
51
+ \begin{table*}[h!]
52
+ \centering
53
+ \resizebox{\textwidth}{!}{
54
+ \input{content/tables/table_taxonomy}
55
+ }
56
+ \caption{Financial NLP Datasets and Their Characteristics}
57
+ \label{tab:table_taxonomy}
58
+ \end{table*}
59
+ \subsection{Error Analysis}
60
+ \label{app:error-analysis}
61
+ This section provides additional insights into the common error types, data contamination concerns, prompt-design pitfalls, and other practical challenges encountered throughout our evaluations. We hope this deeper analysis will inform researchers and practitioners aiming to improve financial LM performance.\\
62
+ \para{Outdated or Degenerate Behavior (LLama\,2\,13B Chat).} During certain classification tasks, \textsc{LLama\,2\,13B} occasionally produces near-empty or trivial outputs (e.g., ``Sure.''), offering zero signal. Such degenerate behavior suggests possible corruption or misalignment in the fine-tuning stage. It also underscores that rechecking model versions, prompts, and tokens processed is essential. Due to this, we chose to not include Llama 2 13B Chat in our main results.\\
63
+ \para{Language Drift (Qwen\,2\,72B).}For summarization tasks in English, \textsc{Qwen\,2\,72B} often begins in English but drifts into Chinese partway through. This reflects the model’s large-scale Chinese pre-training, raising potential domain or language priors that overshadow the instruction’s locale. Developers may mitigate this by adding stronger, repeated language constraints at the prompt level.\\
64
+ \para{Challenges in Causal Classification.} Nearly all models show limited success in identifying financial causal relationships. Such tasks require deeper textual comprehension (beyond keyword matching or shallow patterns) and domain-specific logic (e.g., linking interest rate hikes to bond price changes). Zero-shot in-context learning is typically insufficient for these complex, knowledge-intensive tasks. Future solutions may require structured knowledge bases or explicit symbolic reasoning modules.\\
65
+ \para{Summarization Nuances} Many LMs exhibit strong performance on extractive summarization tasks such as \textsc{ECTSum} and \textsc{EDTSum}, sometimes nearing 80--82\% by BERTScore. However, these scores may overestimate practical utility if the dataset is partially contained in a model’s pre-training data (\emph{data contamination}). In addition, summarization tasks with more abstractive demands or domain-specific jargon often see bigger drops in BERTScore, revealing model gaps in rephrasing and domain knowledge.\\
66
+ \para{Data Contamination and Overlaps} We identify potential overlaps between publicly released financial datasets (\textsc{FinQA}, \textsc{TatQA}, \textsc{EDTSum}) and model pre-training corpora. When test examples leak into the training text, zero-shot performance metrics may be inflated, especially for large-scale public LMs. Mitigation strategies we suggest include: (i) curating new test sets from carefully \emph{time-split} corpora, (ii) deduplicatation of data used for LM training \textbf{\textit{or}} evaluation, and (iii) explicitly checking for exact or near-duplicate overlaps before final evaluation.\\
67
+ \para{Prompt Design Limitations.} Our prompt tuning was done on Llama\,3\,8B for cost reasons. While this improved performance on that specific model, it may not fully generalize to others. For instance, \emph{some} models handle extensive label sets better, while others fail to replicate the exact label formatting. In multiclass tasks like \textsc{Banking77}, LMs sometimes invent new labels or produce minor syntactic variations (\texttt{balance-not-updated} vs.\ \texttt{balance\_not\_updated}). Thorough prompt ablations, or per-model prompt adaptation, might reduce these inconsistencies but can be prohibitively expensive at scale.\\
68
+ \para{LMs and Numeric Regression} LMs tend to handle classification outputs better than continuous-valued regressions (e.g., sentiment scores in \textsc{FiQA} or percentage outputs in \textsc{FinQA}). Generating consistent numeric formats (precision, rounding, decimal vs.\ fraction) can be especially troublesome. We have partially addressed this by employing post-hoc normalization and approximate matching (e.g., ignoring minor decimal differences), but true numeric reliability remains a challenge. We use LM-as-a-Judge to resolve issues when they arise.\\
69
+ \para{Differences Among QA Datasets.} \textsc{ConvFinQA} consistently yields worse performance than \textsc{FinQA}, attributed to multi-turn dialogues, more context switching, and additional reasoning steps. This indicates that each new layer of complexity (conversational vs.\ single-turn, tabular vs.\ textual, etc.) can drastically affect success rates.\\
70
+ \para{Efficiency and Cost Considerations.} Finally, we note that certain models incur substantially higher inference times when dealing with longer contexts (e.g., multi-hop QA or large label sets in classification). Although we do not report exhaustive speed benchmarks here, preliminary measurements show up to a 2$\times$ cost difference among similarly sized models. Such trade-offs imply that even if a model is more accurate in raw performance, real-world systems must balance these gains with practical resource limits.\\
71
+ \subsection{Results by Task Category}\label{app:task_results}
72
+ Below we discuss the results the six major task categories with references to relevant performance tables in this appendix.
73
+
74
+ \subsubsection{Information Retrieval (IR)}
75
+ \label{sec:ir-results}
76
+
77
+ \noindent\textbf{Tasks:} \textsc{FiNER}, \textsc{FinRed}, \textsc{REFinD}, \textsc{FNXL}, and (partially) \textsc{FinEntity} focus on extracting or matching financial entities, relations, or numerals from textual documents.
78
+
79
+ \noindent\textbf{Findings:}
80
+ \begin{itemize}
81
+ \item \textbf{FiNER} sees \textbf{DeepSeek R1} in the lead with F1\,=\,0.807, followed by \textbf{DeepSeek-V3} (0.790) and \textbf{Claude 3.5} (0.799).
82
+ \item \textbf{FinRED} is topped by \textbf{Claude~3.5} at F1\,=\,0.439, whereas others typically score below 0.40.
83
+ \item \textbf{REFinD} is especially noteworthy: \textbf{DeepSeek R1} scores 0.952~F1, while \textbf{Google Gemini} (0.944) and \textbf{GPT-4} (0.942) also excel, demonstrating strong ability in relation extraction with high-quality model prompts.
84
+ \item \textbf{FNXL} remains very difficult: even the top model \textbf{DeepSeek R1} only achieves 0.057\,F1, illustrating that numeric labeling tasks in financial statements demand robust domain logic that few LLMs can capture in a simple prompting regime.
85
+ \end{itemize}
86
+
87
+ \subsubsection{Sentiment Analysis}
88
+ \label{sec:sentiment-results}
89
+
90
+ \noindent\textbf{Tasks:} \textsc{FiQA Task 1} (numeric regression of sentiment), \textsc{FinEntity} (entity-level sentiment), \textsc{SubjECTive-QA (SQA)}, and \textsc{Financial Phrase Bank (FPB)} cover various sentiment subtasks with different input styles (microblogs, annotated corpora, or paragraph-level context).
91
+
92
+ \noindent\textbf{Findings:}
93
+ \begin{itemize}
94
+ \item \textbf{FiQA Task~1} uses MSE.
95
+ \emph{Gemma~2~27B} is the most precise with 0.100~MSE, outdoing bigger models. \textbf{Claude~3.5} (0.101) and \textbf{Cohere~Command~R+} (0.106) follow closely.
96
+ \item \textbf{FPB} sees \textbf{Claude~3.5} scoring 0.944 (accuracy around 94.4\%)---the highest among all tested models. Notably, \textbf{Gemma~2~9B} is close at 0.940, reinforcing that specialized or well-tuned smaller models can challenge much larger ones.
97
+ \item \textbf{FinEntity} (when considered as a sentiment subtask) hits its best F1\,=\,0.662 via \textbf{OpenAI~o1-mini}, surpassing bigger models like Llama~3~70B or Claude~3.5.
98
+ \item \textbf{SubjECTive-QA} is topped by \textbf{Google Gemini} at F1\,=\,0.593, with \textbf{Jamba~1.5~Large} (0.582) also doing well, while many otherwise-strong systems lag behind in this domain-specific subjectivity measure.
99
+ \end{itemize}
100
+
101
+ \subsubsection{Causal Analysis}
102
+ \label{sec:causal-results}
103
+
104
+ \noindent\textbf{Tasks:} \textsc{Causal Detection (CD)} and \textsc{Causal Classification (CC)} measure whether models can identify cause--effect relationships in financial text.
105
+
106
+ \noindent\textbf{Findings:}
107
+ \begin{itemize}
108
+ \item \textbf{Causal Detection (CD)} is led by \textbf{DeepSeek~R1} (F1\,=\,0.337), though absolute scores remain low, with most models below 0.20\,F1. This highlights how purely parametric LLM knowledge may not suffice for nuanced causal cues in financial text.
109
+ \item \textbf{Causal Classification (CC)} sees the best result from \textbf{Mixtral-8x22B} at 0.308\,F1, while many are below 0.25.
110
+ \item Overall, both tasks remain \emph{harder} than simpler classification: even large 70B+ models remain around or under 0.30\,F1, suggesting a gap in robust causal reasoning under zero- or few-shot conditions.
111
+ \end{itemize}
112
+
113
+ \subsubsection{Text Classification}
114
+ \label{sec:classification-results}
115
+
116
+ \noindent\textbf{Tasks:} \textsc{Banking77 (B77)}, \textsc{FinBench (FB)}, \textsc{FOMC}, \textsc{Numclaim (NC)}, and \textsc{Headlines (HL)} collectively test domain-specific classification in finance---from bank queries to monetary policy stances, to short news headlines.
117
+
118
+ \noindent\textbf{Findings:}
119
+ \begin{itemize}
120
+ \item \textbf{Banking77} sees \textbf{DeepSeek~R1} leading with an F1 of 0.763, outpacing GPT-4 (0.710) and DeepSeek-V3 (0.714).
121
+ \item \textbf{FinBench} has an unexpected champion in \textbf{Jamba~1.5~Mini} (0.898~F1), even beating models far larger.
122
+ \item \textbf{FOMC} classification is best handled by \textbf{Claude~3.5} (0.674~F1), just ahead of DeepSeek~R1 (0.670).
123
+ \item \textbf{Numclaim} sees \textbf{GPT-4} on top at 0.750, with \textbf{OpenAI~o1-mini} second at 0.720.
124
+ \item \textbf{Headlines (HL)} is topped by \textbf{Gemma~2~9B} at 0.856, narrowly beating Google Gemini (0.837).
125
+ \end{itemize}
126
+
127
+ \subsubsection{Question Answering (QA)}
128
+ \label{sec:qa-results}
129
+
130
+ \noindent\textbf{Tasks:} \textsc{FinQA} (single-turn numeric QA), \textsc{ConvFinQA} (multi-turn), and \textsc{TATQA} (tabular/text hybrid).
131
+
132
+ \noindent\textbf{Findings:}
133
+ \begin{itemize}
134
+ \item \textbf{FinQA} is topped by \textbf{Claude~3.5} at 0.844 accuracy, with \textbf{DeepSeek-V3} next at 0.840, and GPT-4 + DeepSeek~R1 each at 0.836.
135
+ \item \textbf{ConvFinQA (CFQA)}, more demanding due to multi-turn context, is led by \textbf{DeepSeek~R1} at 0.853, while the second-best is \textbf{OpenAI~o1-mini} at 0.840. GPT-4 lags behind at 0.749, and many other models remain below 0.30.
136
+ \item \textbf{TATQA}, which fuses table and textual reading, also favors \textbf{DeepSeek~R1} (0.858), well above others such as QwQ-32B at 0.796 or GPT-4 at 0.754.
137
+ \end{itemize}
138
+
139
+ \subsubsection{Summarization}
140
+ \label{sec:summ-results}
141
+
142
+ \noindent\textbf{Tasks:} \textsc{ECTSum} (earnings-call transcripts) and \textsc{EDTSum} (financial news headlines) use BERTScore-based metrics.
143
+
144
+ \noindent\textbf{Findings:}
145
+ \begin{itemize}
146
+ \item \textbf{ECTSum} shows \textbf{Google Gemini} achieving the top BERTScore~F1 of 0.777, closely followed by GPT-4 (0.773) and Mixtral-8x22B (0.758).
147
+ \item \textbf{EDTSum} is led by \textbf{Jamba~1.5~Large} at 0.818, with a cluster of models at 0.815--0.817 (Gemma~2~9B, QwQ-32B, Google Gemini).
148
+ \item Overall, summarization tasks see higher absolute scores than more specialized tasks like numeric labeling.
149
+ \end{itemize}
150
+
151
+ \subsection{Efficiency and Cost Analysis}
152
+ \label{app:efficiency-analysis}
153
+
154
+ We calculated the cost to run each dataset and model using the saved inference results. This does not include evaluation costs, but as those were all done with Llama 3.1 8b, they should be significantly less variable than the inference costs for different providers and models.
155
+
156
+ \begin{table*}[h!]
157
+ \centering
158
+ \resizebox{\textwidth}{!}{
159
+ \input{content/tables/model_dataset_cost}
160
+ }
161
+ \caption{Cost Analysis Table. All prices listed in USD. SQA costs are an estimate based off known inputs and outputs, as the exact costs were not saved.}
162
+ \label{tab:cost_analysis}
163
+ \end{table*}
FLaME/content/appendices/appendix_taxonomy.tex ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Taxonomy of Financial Scenarios}\label{app:taxonomy}
2
+ \paragraph{Tasks.}
3
+ We focus on six core NLP tasks \dash textit{question answering}, \textit{information retrieval}, \textit{summarization}, \textit{sentiment analysis}, \textit{toxicity detection}, and \textit{text classification} category for miscellaneous labeling tasks. These tasks are \textit{user-facing} for finance: they reflect practical objectives like extracting key information from company filings, summarizing earnings reports, detecting false or harmful content in financial forums, and classifying transactions or documents. Although many sub-categories of tasks exist within each broad task category (\eg, named entity recognition, structured boundary detection, causal reasoning), we group them under broader categories where possible, to keep the focus on the end-user or enterprise-facing application in financial scenarios.
4
+ \paragraph{Domains.}
5
+ We define a domain by textit{what} is the type of data, textit{who} produced it, textit{when} it was created, \textit{where} did it originate, \textit{how} it was generated, and \textit{why} is it useful. Examples of domains include (i) \textit{publicly-traded corporations} producing investor filings, (ii) \textit{regulatory bodies} issuing policies and enforcement documents, (iii) \textit{news media} offering breaking market updates, (iv) \textit{SMBs} managing internal accounting ledgers, and (v) \textit{individual investors} discussing trades on social media. Each domain introduces unique formats (\eg, structured filings vs. informal posts) and unique constraints (\eg, legal compliance vs. personal expression). By taxonomizing these domains, researchers can use \papertitle to identify coverage gaps and propose new benchmark datasets for under-served financial scenarios
6
+ \paragraph{What (Type of Data/Annotations).} This refers to the nature of the dataset, whether it includes structured financial records (e.g., SEC filings), informal text (e.g., social media discussions), regulatory reports, or analyst commentary. Annotations can range from human-labeled categories to machine-generated insights.
7
+ \paragraph{Who (Data Source).} The entity that produced or collected the dataset, such as individuals (personal finance data), businesses (corporate records), financial institutions (bank transactions), regulators (policy statements), or media sources (news articles).
8
+ \paragraph{Where (Data Origination \& Distribution).} The source repository of the dataset \dash e.g., regulatory databases, company websites, news platforms, or user-generated content from social media.
9
+ \paragraph{When (Time Sensitivity \& Temporal Scope).} The time period of the dataset, distinguishing between historical, recent, and real-time data. Financial data has strong temporal relevance, affecting its usability for different research tasks.
10
+ \paragraph{How (Data Generation \& Annotation).} Describes whether the dataset was self-reported, institutionally recorded, scraped from public sources, or generated synthetically. Annotation can be performed by experts, crowd workers, automated scripts, or AI models.
11
+ \subsection{Tasks}\label{app:tasks}
12
+ \paragraph{Question answering.}
13
+ In financial QA, models answer questions about company disclosures, regulatory text, or market data. For example, a user may ask, “What was Company X’s net income last quarter?” or “Under which clause must this fund disclose assets?” These tasks can be open-book (access to filings or transcripts) or closed-book (testing a model’s internalized domain knowledge). Accuracy and factual correctness are paramount, as erroneous answers can mislead analysts or investors.
14
+ \paragraph{Information retrieval.}
15
+ Here, the system locates relevant text or documents from large financial corpora, such as retrieving the correct section in an SEC filing that addresses a particular risk factor. This typically involves ranking passages or paragraphs by relevance. Good performance in financial IR helps analysts quickly navigate extensive disclosures, saving time and reducing information overload.
16
+ \paragraph{Summarization.}
17
+ Summaries condense lengthy financial documents like earnings reports or regulatory proposals into concise abstracts. Abstractive summarization can highlight key takeaways for investors, while extractive approaches ensure faithfulness to the original text. Faithfulness is critical in finance; hallucinated or misleading summaries can create compliance issues or misinform market participants.
18
+ \paragraph{Sentiment analysis.}
19
+ Sentiment tasks in finance often involve gauging the emotional tone of news headlines, social media chatter, or analyst commentary. Models can help traders or risk managers track public sentiment around specific stocks, detect shifts in market mood, or monitor customer feedback. Unlike general sentiment tasks, financial sentiment often leans heavily on domain-specific lexicons and context (\eg, “downward revision” vs. “positive guidance”).
20
+ \paragraph{Causal Analysis.}
21
+ Causal analysis in finance focuses on identifying cause-and-effect relationships within economic events, financial policies, or market movements. Models can help analysts determine whether a policy change influenced stock prices or assess the impact of macroeconomic factors on investment trends. Unlike general causal inference tasks, financial causal analysis often relies on structured data, temporal dependencies, and domain-specific knowledge (\eg, “interest rate hike leading to capital outflows” vs. “regulatory easing boosting market liquidity”).
22
+ \paragraph{Text classification.}
23
+ Beyond these core tasks, many finance-specific classification needs arise, such as identifying fraudulent activities (\eg, “phishing scam” vs. “legitimate inquiry”), labeling compliance documents by topic, or categorizing support tickets (\eg, “credit card issue” vs. “mortgage application”). This \emph{miscellaneous} category accommodates various text classification tasks at different granularity.
24
+
25
+ \subsection{Domains}\label{app:domains}
26
+ \subsection{What}\label{app:domain_what}
27
+ \textit{"What is the type of data/annotations?"}
28
+ \paragraph{Personal Finances.}
29
+ Personal finances include documents and records related to individual households’ finances. This category broadly covers self-generated financial records such as personal budgets, expense logs, cash flow statements, and official documents like individual income tax filings (\eg, IRS Form 1040). In addition, the category covers data collected about individuals by financial institutions, including bank statements, transaction logs, and credit reports. These data sources are used in various NLP tasks such as information extraction, summarization, sentiment analysis (\eg, for credit risk), and the generation of personalized financial advice. A clear distinction should made between first-party data (directly produced or owned by individuals) and third-party data (collected about individuals by institutions), with derived data and metrics (\eg, credit reporting and scores) recognized as distinct types.
30
+ \paragraph{SMB Finances.}
31
+ Small and Medium Business (SMB) finances include the financial records generated and maintained by small enterprises. This category comprises internal documents such as accounting statements (balance sheets, income statements, and cash flow statements), invoices, payroll records, and business tax filings. It also encompasses external data collected about SMBs by financial institutions and credit bureaus, such as transaction logs and business credit reports. NLP applications for this data focus on information extraction, text classification, and summarization tasks. The category includes data produced directly by SMBs (first-party data) and data collected by third-party entities (external assessments).
32
+ \paragraph{Social Media \& Investor Forums.}
33
+ This includes content from public platforms where individual investors discuss financial markets. Social media posts are real-time and high-volume, often opinionated and informal (emojis, memes, humor, or hyperbole). Annotation often relies on crowd-sourcing of sentiment and toxicity labels. Examples of tasks include sentiment analysis, toxicity detection, text classification, and summarization. The category includes data (i.e., post text and image) produced directly by individuals (first-party), as well as data collected about the individual or their user behavior (third-party). % datasets include SemEval-2017 Task 5, FiQA-2018, and Reddit WallStreetBets archives.
34
+ \paragraph{Financial News \& Media.}
35
+ Produced by major news agencies, news about current events and finance informs markets about macroeconomics, company earnings, and opinionated analysis. News types range from real-time reports and market analyses to press releases. Financial news is high-frequency, continuously updated, and distributed via news terminals, APIs, and web sources. Annotations can include topic categories, sentiment scores, and event classifications. NLP tasks include information retrieval, text classification, sentiment analysis, and summarization. % datasets include Reuters RCV1/RCV2, Financial PhraseBank, and SemEval-2017 Task 5.
36
+ \paragraph{Corporate Disclosures \& Filings.}
37
+ Corporate disclosures include financial reports such as 10-K annual reports, 10-Q quarterly reports, earnings call transcripts, and press releases. These documents are produced by public corporations, primarily for legal compliance, investor transparency, and shaping market sentiment. They consist of formal reports, earnings transcripts, and event-driven disclosures. The frequency varies, with periodic reports released annually or quarterly and event-driven disclosures appearing as needed. Creation follows regulatory formats, typically unannotated, but some datasets add expert labels for sentiment analysis and summarization. Distribution occurs through company websites, regulatory databases, and press release services. Example tasks include summarization, information extraction, sentiment analysis, and question-answering.
38
+ % datasets include FinQA \cite{Chen2021-hr}, DocFinQA \cite{Reddy2024-gs}, SubjECTive-QA \citeTODO{Pardawala2024-lj}, and others.
39
+ \paragraph{Regulatory \& Legal Disclosures.}
40
+ This includes regulatory filings, policy statements, legislation, and central bank reports. Producers include financial regulators, central banks, and legislative bodies, aiming to ensure transparency, market regulation, and compliance guidance. These texts range from proposed rules and legislation to policy statements and enforcement actions, with varying publication frequency. Regulatory texts are formal and often lengthy, with limited public annotation. NLP tasks include text classification, summarization, information extraction, and stance detection.
41
+ % datasets include FinBen, Central Bank Statements Corpus, and industry reports from the BIS and IMF.
42
+ \paragraph{Analyst \& Research Reports.}
43
+ These reports are created by investment banks, rating agencies, and independent analysts to provide in-depth financial analysis and recommendations. They include equity research reports, macroeconomic outlooks, and credit rating evaluations, which are published periodically and are event-driven. Reports are proprietary, limiting public access, though some analyst reports appear in regulatory filings. NLP tasks include sentiment analysis, recommendation classification, summarization, and information extraction.
44
+ % datasets include the Analyst Report Corpus and EDGAR Analyst Reports.
45
+ \paragraph{Emerging \& Alternative Finance.} This category includes cryptocurrency whitepapers, FinTech credit reporting data, and novel forms of financial products. Data producers range from blockchain communities to financial regulators. Alternative data is diverse in format and frequency. NLP tasks include entity recognition, scam detection, summarization, and bias analysis.
46
+ % datasets include Crypto Whitepaper Corpus, CoinDesk Headlines, and CFPB Consumer Complaints.
47
+ \subsection{Who}\label{app:domain_who}
48
+ \textit{"Who generated the data/annotations?"}
49
+ \paragraph{Individuals \& Households.}
50
+ This category covers the financial data originating from individuals' activity. It includes self-generated financial records (such as budgets, expenses, and receipts) and data produced by financial institutions on behalf of individuals ( bank statements, loan documents, etc.).
51
+ \paragraph{Small and Medium Businesses (SMBs).}
52
+ This category pertains to the financial data produced by SMBs. It involves internally generated documents such as accounting records, invoices, payroll information, and tax filings, alongside externally collected data like business credit reports and bank transaction records. NLP systems may use this data to automate financial management tasks, improve risk assessments, and facilitate credit underwriting for smaller enterprises. Differentiations are made between first-party data (generated by the SMB) and third-party data (collected about the SMB).
53
+ \paragraph{Commercial \& Retail Banks.}
54
+ Banking institutions accept deposits, extend credit, and provide loans to consumers and businesses. Larger banks have lines of business that include retail banking (i.e., individual customers), business banking (small and medium companies), and commercial banking (enterprise clients) operations. They generate extensive text-based data, including annual reports, quarterly earnings reports, and shareholder letters. Regulatory reports such as SEC 10-K/10-Q forms disclose financials and risks. Internally, banks maintain risk management reports, compliance documents, and customer communications (emails, chat logs). Most internal documents are proprietary, while investor reports and required filings are public.
55
+ \paragraph{Investment Banks \& Brokerage Firms.}
56
+ Investment banks facilitate securities offerings, mergers and acquisitions, and other complex financial transactions. Brokerage firms execute trades for clients. These institutions produce financial research reports, prospectuses, and offering memoranda for investment offerings. Internally, they generate pitch books, trading desk reports, and compliance documentation. Public documents include financial research and regulatory filings, while deal-related and internal reports remain proprietary.
57
+ \paragraph{Asset Management Firms.}
58
+ Asset managers invest pooled funds on behalf of clients, including mutual funds, pension funds, and investment advisors. They produce fund prospectuses, shareholder reports, investor letters, and market outlooks. Internally, they maintain investment committee memos, research reports, and risk reports. Public mutual fund documents and investor letters are available, whereas internal research and risk memos usually remain confidential.
59
+ \paragraph{Hedge Funds \& Private Investment Firms.}
60
+ Hedge funds and private investment firms manage private capital with flexible investment strategies. They produce strategy documents, trading models, and investor update letters. Capital-raising documents such as Private Placement Memoranda (PPM) outline strategies, risks, and terms. Regulatory filings like Form 13F are public, but trading strategies and internal risk/compliance reports remain confidential.
61
+ \paragraph{Insurance Companies.}
62
+ Insurance firms underwrite risk policies and manage significant investment portfolios. They generate insurance policy contracts, actuarial reports, claims reports, and risk assessments. Regulatory filings include financial statements and risk-based capital reports. Public documents include policies and financial reports, whereas underwriting guidelines and claims analyses remain proprietary.
63
+ \paragraph{Regulators \& Central Banks.}
64
+ Regulators oversee financial markets, ensuring stability and compliance. Examples include the Security Exchange Commission (SEC), the Federal Reserve, the Basel Committee on Banking Supervision, and the European Central Bank. These entities produce regulations and guidance documents, monetary policy statements, financial stability reports, and enforcement rulings. Many regulatory texts are public, though supervisory communications and compliance assessments remain private.
65
+ \paragraph{Government Finance Departments.}
66
+ Finance ministries manage government fiscal policy and economic regulation. They produce budget statements, policy white papers, press releases, and financial analysis reports. Most documents are public, though some internal memos and briefings remain confidential.
67
+ \paragraph{Financial Technology Companies.}
68
+ Financial Technology companies (FinTech) engage in financial services innovation through technology, including digital banking, AI agents, investment technologies, cryptocurrency exchanges, and others. They produce customer agreements, product documentation, and white papers. Some FinTechs generate regulatory filings and compliance reports. Customer-facing documents are typically public, while internal reports and transaction logs remain private.
69
+ \paragraph{Legal \& Compliance Bodies.}
70
+ These entities ensure regulatory adherence and oversee legal aspects of finance. They generate compliance manuals and audit reports (i.e., Suspicious Activity Reports) and publish legal advisories. While many compliance documents remain internal, some client advisories and industry guidelines are publicly available.
71
+ \subsection{Where}\label{app:domain_where}
72
+ \textit{"Where was the data generated/annotated?"}
73
+ In finance, textual data arises from multiple channels. Corporate disclosures are uploaded to regulatory databases (\eg, the SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR)), press releases appear on news-wires or company websites, and social media data is generated globally. Annotation can be handled by specialized providers (\eg, rating agencies for risk labeling) or crowd-sourced platforms. Consequently, the “where” dimension includes the physical location of data creators or annotators and the digital repositories hosting the final datasets (\eg, regulatory websites, aggregator platforms, or data brokers).
74
+ \subsection{When}\label{sec:taxonomy:when}
75
+ \textit{"When was the data generated/annotated?"}
76
+ Finance is \textbf{time-sensitive}. Data from an older annual report (\eg, 2010) may be of historical research value, while a live earnings call is relevant to immediate trading decisions. Datasets could be divided into further categories \emph{historical}, \emph{recent}, or \emph{live-streaming}. The time also affects legal obligations (\eg, updated regulations), context relevance (macroeconomic conditions), and any potential dataset drift over time (\eg, new financial terminology, products, services).
77
+ % \subsubsection{Why}\label{sec:taxonomy:why}
78
+ % \textit{"Why would the data/annotations be used?"}
79
+ % In finance, the motivations range from legal compliance (meeting regulatory disclosure requirements) to investor relations (transparency for shareholders) or internal risk management (spotting financial misconduct). Data often enables specific downstream applications—like building credit-scoring models or automating customer support. Understanding “why” data is created or used helps identify nuances in the data (\eg, self-reported vs. legally mandated) and the real-world implications for any NLP-driven downstream uses. We consider the real-world uses of benchmark datasets during categorization or metric selections. For example, data sets related to anti-money laundering focus on text classification to detect fraud and might prioritize recall to catch potential wrongdoing. In contrast, a financial analyst focuses on text classification for document classification.
80
+ \subsection{How}\label{app:domain_how}
81
+ \textit{"How was the data generated/annotated?"}
82
+ Financial data generation spans official reporting (formal documentation mandated by regulations) and user-generated content (social media, customer chats). Annotation might be done by subject matter experts (\eg, compliance officers labeling risk factors), professional analysts (\eg, rating agencies), crowd workers (\eg, annotator labeling), or machines (\eg, AI labeling services). The expertise needed often correlates with the data’s complexity—highly technical documents (\eg, derivative contracts) demand specialized annotators to ensure label accuracy. Annotations may be partially or fully automated, leveraging pattern-matching or prior language models to reduce costs.
83
+ \subsection{Language}\label{app:language}
84
+ \textit{"Language used for data/annotations?"}
85
+ Currently, \papertitle focuses on \textbf{English}, reflecting its widespread use in global financial markets and regulatory documents. However, finance also includes other major world languages for company disclosures, investor communications, and cross-border transactions. Future expansions may incorporate multilingual corpora to reflect cross-national markets better. For now, we emphasize that language coverage remains incomplete and is a major area for community-driven growth.
FLaME/content/datasets/banking77.tex ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ \textbf{Banking77 (B77)} \cite{Casanueva2020-oa} is a fine-grained dataset designed for intent detection within the banking domain. It comprises 13,083 customer service queries annotated with 77 unique intents, such as card\_arrival and lost\_or\_stolen\_card. The dataset focuses on single-domain intent classification, providing a granular view of customer queries in the banking sector. With 10,003 training and 3,080 test examples, Banking77 offers a valuable resource for evaluating machine learning models in intent detection. The dataset has been curated to fill the gap in existing intent detection datasets, which often feature fewer intents or cover multiple domains without the depth offered here. The Banking77 dataset is publicly available under the \textbf{MIT License}.
2
+ % \citet{Ying2022-fo} investigates potential labeling errors in Banking77...
FLaME/content/datasets/convfinqa.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{ConvFinQA (CFQA)}\cite{Chen2022-ae} multi-turn question answering is a large-scale dataset designed to explore the chain of numerical reasoning in conversational question-answering within the financial domain. It consists of 3,892 conversations and 14,115 questions, where the conversations are split between 2,715 simple and 1,177 hybrid conversations. ConvFinQA focuses on modeling complex, long-range numerical reasoning paths found in real-world financial dialogues. The dataset is a response to the growing need to study complex reasoning beyond pattern matching, and it includes experiments with neural symbolic and prompting-based methods to analyze reasoning mechanisms. This resource pushes the boundaries of research on numerical reasoning and conversational question-answering in finance. The ConvFinQA dataset is released under the \textbf{MIT License.}
FLaME/content/datasets/ectsum.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{ECTSum} \cite{Mukherjee2022-cj} is designed for bullet-point summarization of long earnings call transcripts (ECTs) in the financial domain. It consists of 2,425 document-summary pairs, with the transcripts sourced from publicly traded companies' earnings calls between January 2019 and April 2022. Each transcript is a lengthy, unstructured document, and the summaries are concise, telegram-style bullet points extracted from Reuters articles. These summaries focus on key financial metrics such as earnings, sales, and trends discussed during the calls. ECTSum addresses the challenge of summarizing complex financial data into short, meaningful summaries, making it a valuable benchmark for evaluating summarization models, particularly in the context of financial reporting. The ECTSum dataset is released under the \textbf{GPL-3.0 license}.
FLaME/content/datasets/edtsum.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{EDTSum} \cite{Xie2024-pn} is a financial news summarization resource designed to evaluate the performance of large language models (LLMs) in generating concise and informative summaries. It comprises 2,000 financial news articles, each paired with its headline serving as the ground-truth summary. These articles were manually selected and cleaned from the dataset introduced by to ensure high-quality annotations. The original dataset \cite{Zhou2021-il} focuses on corporate event detection and text-based stock prediction, containing 9,721 news articles with token-level event labels and 303,893 first-hand news articles with minute-level timestamps and comprehensive stock price labels. The EDTSum dataset provides a benchmark for financial text summarization. The EDTSum dataset is \textbf{publicly available}.
FLaME/content/datasets/finbench.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{FinBench (FB)} \cite{Yin2023-nf} is a dataset designed to evaluate the performance of machine learning models using both tabular data and profile text inputs, specifically within the context of financial risk prediction. The FinBench dataset consists of approximately 333,000 labeled instances, covering three primary financial risks: default, fraud, and churn. Each instance is labeled as "high risk" or "low risk". The time frame of data collection varies by dataset. The dataset accompanies FinPT, an approach that leverages Profile Tuning using foundation LMs. The core task is to transform tabular data into natural-language customer profiles via LMs for enhanced prediction accuracy. This benchmark falls under financial risk prediction. The FinBench dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International \textbf{(CC BY-NC 4.0) license}.
FLaME/content/datasets/fincausal.tex ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \textbf{FinCausal-SC} \cite{Mariko2020-by} is a dataset for cause-effect analysis in financial news texts. It consists of 29,444 text sections (each containing up to three sentences), with 2,136 annotated as causal and accompanied by cause-effect spans.
2
+ FinCausal focuses on two tasks:
3
+
4
+ \textbf{(1) Causality Classification (CC).}
5
+ Determine if a given text section contains a causal relation. Each text section is labeled with
6
+ Gold = 1 if a causal statement is present and 0 otherwise.
7
+
8
+ \textbf{(2) Causality Detection (CD).}
9
+ For those text sections identified as causal, the task is to extract the Cause and Effect spans. In total, there are 796 instances annotated for cause-effect extraction. These include both unicausal cases (with an average of 621.67 instances) and multicausal cases (with an average of 174.33 instances). This task challenges models to handle potentially complex causal chains, where one event can trigger multiple consequences or multiple factors can lead to a single outcome.
10
+
11
+ FinCausal-SC pushes beyond simple keyword matching toward more nuanced and context-aware understanding of financial news articles. This dataset is published under the \textbf{CC0 License}.
FLaME/content/datasets/finentity.tex ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \textbf{FinEntity (FE)} \cite{Tang2023-sm} is an entity-level sentiment classification dataset designed for financial news analysis. It contains 979 financial news paragraphs, featuring 2,131 manually-annotated financial entities classified into positive, negative, and neutral sentiment categories. The dataset was sourced from Refinitiv Reuters Database, ensuring high-quality financial news coverage. Data collection focused on financial entities such as companies, organizations, and asset classes, excluding persons, locations, and events. The dataset employs a BILOU labeling scheme for entity tagging and sentiment classification. Fine-tuned BERT and FinBERT models significantly outperform ChatGPT in this task. Additionally, the FinEntity dataset has been applied to cryptocurrency news (15,290 articles from May 2022 to February 2023), demonstrating stronger correlations between entity-level sentiment and cryptocurrency prices compared to traditional sequence-level sentiment models. The FinEntity dataset is licensed under the Open Data Commons Attribution License \textbf{(ODC-BY) license.}
2
+
3
+
4
+ Previous work on FinEntity, such as \cite{Xing2025-qo},
5
+ focuses on sentiment classification and does not account for entity extraction in the same manner. Specifically, prior approaches often introduce random insertions to handle unclear or irrelevant outputs, which is not applicable to our evaluation setting where exact entity matching is also considered.
6
+
7
+
8
+ The FinEntity task involves entity extraction and sentiment classification. For our evaluations, span boundary detection is not considered. This evaluation metric treats outputs as sets rather than enforcing exact span alignment.
9
+
10
+ \textbf {Entity-Based Comparison}
11
+
12
+ Given the predicted and ground-truth entity sets:
13
+
14
+ $$
15
+ E_p = \{e_{p1}, e_{p2}, \ldots, e_{p_{N_p}}\}
16
+ $$
17
+
18
+ $$
19
+ E_t = \{e_{t1}, e_{t2}, \ldots, e_{t_{N_t}}\}.
20
+ $$
21
+
22
+ Each entity \( e \) is represented as:
23
+
24
+ \[
25
+ e = (\text{value},\, \text{tag},\, \text{label}).
26
+ \]
27
+
28
+ An entity in the predicted set is considered a match if it exactly equals any ground-truth entity:
29
+
30
+ \[
31
+ M = \{ e \in E_p : e \in E_t \}.
32
+ \]
33
+
34
+ \textbf{Proposed Evaluation Metric}
35
+
36
+ We compute:
37
+
38
+ $$
39
+ P = \frac{|M|}{|E_p|}, \\
40
+ $$
41
+
42
+ $$
43
+ R = \frac{|M|}{|E_t|}, \\
44
+ $$
45
+
46
+ $$
47
+ F1 = \frac{2 P R}{P + R}, \\
48
+ $$
49
+
50
+ $$
51
+ \text{Accuracy} = \frac{|M|}{|E_t|}.
52
+ $$
53
+
54
+ Since our evaluation is entity-level, accuracy is equivalent to recall. Unlike prior work that enforces strict length matching, we adopt a more flexible metric to better align with the nature of LLM outputs. This allows for partial credit and avoids assigning a score of zero when predictions differ in length from the ground truth.
55
+
FLaME/content/datasets/finer.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{FiNER-Open Research Dataset (FiNER-ORD)} \cite{Shah2023-hr} is a manually annotated dataset comprising 47,851 financial news articles (in English) collected from webz.io. Each article is a JSON document containing metadata such as the source, publication date, author, and title. A subset of 220 randomly sampled documents was manually annotated, with 201 remaining after filtering out empty articles. The dataset was manually labeled using Doccano, an open-source annotation tool, with annotations for person (PER), location (LOC), and organization (ORG) entities. This annotated dataset benchmarks model performance for financial named entity recognition. Further annotation guidelines are available in the dataset's documentation. The main metric used for evaluations of the models for the FiNER-ORD dataset is Macro F1. The FiNER-Open Research Dataset (FiNER-ORD) is available under the Creative Commons Attribution-NonCommercial 4.0 International \textbf{(CC BY-NC 4.0) license }.
FLaME/content/datasets/finqa.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{FinQA} \cite{Chen2021-hr} is a large-scale dataset designed for numerical reasoning over financial data, consisting of 8,281 question-answer pairs derived from financial reports authored by experts. The dataset addresses the complexity of analyzing financial statements, which requires both deep understanding and intricate numerical reasoning. Unlike general QA tasks, FinQA focuses on questions that demand the interpretation of financial data and multi-step reasoning to reach an answer. The dataset is fully annotated with reasoning programs to ensure explainability, making it a valuable resource for advancing research in automated financial analysis. The FinQA dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International \textbf{(CC BY-NC 4.0) license.}
FLaME/content/datasets/finred.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{FinRED (FR)} \cite{Sharma2022-dt} dataset is a specialized relation extraction dataset tailored to the financial domain, created to address the gap where existing models trained on general datasets fail to transfer effectively to financial contexts. It comprises data curated from financial news and earnings call transcripts, with financial relations mapped using a distance supervision method based on Wikidata triplets. To ensure robust evaluation, the test data is manually annotated. The dataset provides a benchmark for evaluating relation extraction models, revealing a significant performance drop when applied to financial relations, highlighting the need for more advanced models in this domain. The FinRED dataset is released under the Creative Commons Attribution 4.0 International \textbf{(CC BY 4.0) license.}
FLaME/content/datasets/fiqa.tex ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ \textbf{FiQA} \cite{Maia2018-hg} has two sub tasks. \textbf{\textit{FiQA Task 1}} focuses on aspect-based financial sentiment analysis. Given a financial text, such as microblog posts or news headlines, systems are tasked with identifying the specific target aspects mentioned and predicting their corresponding sentiment scores on a continuous scale from -1 (negative) to 1 (positive). The challenge involves accurately linking financial entities or topics to the appropriate sentiment, such as distinguishing between corporate strategy decisions of companies. For evaluation, systems are measured on their ability to correctly classify aspects, attach sentiment to those aspects, and predict sentiment with metrics like precision, recall, F1-score, and regression-based measures (MSE and R-squared).
2
+ \textbf{\textit{FiQA Task 2}} addresses opinion-based question answering (QA) over financial data, where systems must answer natural language questions by retrieving relevant financial opinions and facts from a knowledge base of structured and unstructured documents (such as reports, news, and microblogs). This task requires systems to either rank relevant documents from the knowledge base or generate answers directly. Opinion-based questions require identifying entities, aspects, sentiment, and opinion holders, with performance evaluated on metrics like F-score, Normalized Discounted Cumulative Gain (NDCG), and Mean Reciprocal Rank (MRR). The QA test collection includes diverse sources like StackExchange, Reddit, and StockTwits, focusing on ranking and answering accuracy.
FLaME/content/datasets/fnxl.tex ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \textbf{The Financial Numeric Extreme Labeling (FNXL) dataset} \cite{Sharma2023-ir} addresses the challenge of automating the annotation of numerals in financial statements with appropriate labels from a vast taxonomy. Sourced from the U.S. Securities and Exchange Commission's (SEC) publicly available annual 10-K reports from 2019 to 2021, the FNXL dataset comprises 79,088 sentences containing 142,922 annotated numerals, categorized under 2,794 distinct labels.
2
+
3
+ The FNXL task involves extracting numerical values associated with specific XBRL tags. Unlike traditional named entity recognition, this task requires set-based numerical comparison. Thus, we cannot use Entity F1 scores directly.
4
+
5
+ Normalization is applied consistently across all datasets to reduce inconsistencies, including case standardization and whitespace stripping, but we do not explicitly define it per dataset.
6
+
7
+ \textbf{Set-Based Comparison and Partial Credit}
8
+
9
+ Each tag is associated with a set of numerical values, and we evaluate based on set overlap rather than exact string matching. Given the predicted and ground-truth mappings:
10
+
11
+ $
12
+ T_p = \{(t_p, S_p)\}
13
+ $
14
+
15
+ $
16
+ T_t = \{(t_t, S_t)\},
17
+ $
18
+
19
+ where \( S_p \) and \( S_t \) are sets of numerical values, we compute:
20
+
21
+
22
+ $M_t = S_p \cap S_t,$ \\
23
+ $TP = \sum_{t} |M_t|$, \\
24
+ $FP = \sum_{t} |S_p - M_t|$ \\
25
+ $FN = \sum_{t} |S_t - M_t|.$
26
+
27
+
28
+ The total actual and predicted values are given by:
29
+
30
+ $
31
+ \text{Total}_{\text{actual}} = \sum_{t} |S_t|
32
+ $
33
+
34
+ $
35
+ \text{Total}_{\text{predicted}} = \sum_{t} |S_p|.
36
+ $
37
+
38
+ \textbf{Evaluation Metrics}
39
+
40
+ We compute precision, recall, and F1 score using standard formulae.
41
+
42
+ Additionally, we define a Jaccard-inspired accuracy measure:
43
+
44
+ $
45
+ \text{Accuracy} = \frac{TP}{\text{Total}_{\text{actual}} + \text{Total}_{\text{predicted}} - TP}
46
+ $
47
+
48
+
49
+ This evaluation metric allows for partial credit by considering numerical overlaps instead of enforcing exact matches, which is crucial given the nature of LLM predictions.
FLaME/content/datasets/fomc.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{Federal Open Market Committee (FOMC)} \cite{Shah2023-bh} dataset is a large-scale, tokenized, and annotated dataset designed to analyze the impact of monetary policy announcements on financial markets. It comprises FOMC speeches, meeting minutes, and press conference transcripts collected from 1996 to 2022. The dataset introduces a novel task of hawkish-dovish classification, where the goal is to classify the stance of FOMC communications into hawkish (policy tightening), dovish (policy easing), or neutral categories. The dataset is accompanied by various metadata, including the speaker and publication date. It was curated using both rule-based methods and manual annotation, and it has been benchmarked using state-of-the-art pre-trained models like RoBERTa, BERT, and others. The dataset aims provides resource for understanding how FOMC communications influence financial markets, including stock and treasury yields. The Federal Open Market Committee (FOMC) dataset is publicly available under the Creative Commons Attribution-NonCommercial 4.0 International \textbf{(CC BY-NC 4.0) license.}
FLaME/content/datasets/fpb.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{Financial Phrase Bank (FPB)} \cite{Malo2013-el}, is a dataset for sentiment analysis in financial news. It contains 4,840 sentences sourced from English-language financial news articles, categorized by sentiment as positive, negative, or neutral. Each sentence reflects the sentiment an investor might perceive from the news with respect to its influence on stock prices. The dataset is annotated by a group of 16 annotators with a background in finance, using a majority vote approach. It is available in four different configurations based on annotator agreement levels (50\%, 66\%, 75\%, and 100\%). FPB is used as resource for financial sentiment analysis, especially for training and benchmarking models in the financial domain. The Financial Phrase Bank (FPB) dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported \textbf{(CC BY-NC-SA 3.0) License}.
FLaME/content/datasets/headlines.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ News \textbf{Headline (HL)} Classification \cite{Sinha2021-ax} dataset consists of 11,412 human-annotated financial news headlines focused on commodities, particularly gold. The dataset spans a collection period from 2000 to 2019. It includes publication date, article URL, and the news headline itself, and binary indicators that capture key financial aspects, including whether the headline mentions a price, the direction of price movement, and references to past or future prices and news. This dataset is valuable for analyzing sentiment and market trends based on news articles, making it a useful resource for financial analysis, trading strategy development, and research in sentiment analysis within the financial domain. The News Headline Classification dataset is licensed under the Creative Commons Attribution-ShareAlike 3.0 \textbf{(CC BY-SA 3.0) license.}
FLaME/content/datasets/numclaim.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{Numerical Claim Detection Dataset (NC)} \cite{Shah2024-cr} is an expert-annotated dataset designed for detecting fine-grained investor claims within financial narratives, with a focus on the role of numerals. The dataset was constructed by sampling and annotating financial-numeric sentences from a large collection of 87,536 analyst reports (2017–2020) and 1,085 earnings call transcripts (2017–2023). Specifically, 96 analyst reports (two per sector per year) were sampled, containing 2,681 unique financial-numeric sentences, alongside 12 randomly selected earnings call transcripts (two per year), contributing 498 additional financial-numeric sentences. Each sentence was manually labeled as either "In-claim" or "Out-of-claim" by two annotators with foundational expertise in finance, ensuring high-quality annotations. This dataset facilitates the study of numerical claim detection in financial discourse and serves as a resource for argument mining and investor sentiment analysis. The Numerical Claim Detection dataset is licensed under the Creative Commons Attribution 4.0 International \textbf{(CC BY 4.0) license}.
FLaME/content/datasets/refind.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{REFinD (RD)} \cite{Kaur2023-we} is a specialized relation extraction dataset created to address the unique challenges of extracting relationships between entity pairs from financial texts. With approximately 29,000 annotated instances and 22 distinct relations across 8 types of entity pairs, it stands out as the largest-scale dataset of its kind, specifically generated from financial documents, including Securities and Exchange Commission (SEC) filings. This dataset aims to fill the gap left by existing relation extraction datasets, which are predominantly compiled from general sources like Wikipedia or news articles. The REFinD dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International \textbf{(CC BY-NC 4.0) License}.
FLaME/content/datasets/subjectiveqa.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{SubjECTive-QA (SQA)} \cite{Pardawala2024-lj} is a manually-annotated dataset focusing on subjectivity and soft misinformation in Earnings Call Transcripts (ECTs), specifically in their long-form QA sessions. It includes 49,446 annotations across 2,747 QA pairs from 120 ECTs spanning 2007 to 2021. Each QA pair is labeled on six subjectivity features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant. The dataset was benchmarked using RoBERTa-base and Llama-3-70b-Chat, showing varying performance based on feature subjectivity. Additionally, cross-domain evaluation on White House Press Briefings demonstrated its broader applicability. The SubjECTive-QA dataset is licensed under the Creative Commons Attribution 4.0 International \textbf{(CC BY 4.0) License}.
FLaME/content/datasets/tatqa.tex ADDED
@@ -0,0 +1 @@
 
 
1
+ \textbf{TAT-QA (TQA)} \cite{Zhu2021-ig} is a large-scale question-answering (QA) dataset designed for hybrid data sources, combining both tabular and textual content, particularly from financial reports. The dataset emphasizes numerical reasoning, requiring operations such as addition, subtraction, comparison, and more to infer answers from both tables and text. Extracted from real-world financial reports, TAT-QA challenges QA models to handle complex data formats, addressing a gap in existing research which often overlooks hybrid data. A new model, TAGOP, was introduced to tackle this challenge by extracting relevant cells and text spans for symbolic reasoning, achieving an F1 score of 58.0\%, though still falling short of expert human performance (90.8\%). TAT-QA provides a critical benchmark for advancing QA models in finance. The TAT-QA dataset is licensed under the Creative Commons Attribution 4.0 International \textbf{(CC BY 4.0) License.}
FLaME/content/figures/fig_methodology_domain.pdf ADDED
Binary file (125 kB). View file
 
FLaME/content/figures/fig_methodology_tasks.pdf ADDED
Binary file (83.2 kB). View file
 
FLaME/content/figures/fig_overview_flow.pdf ADDED
Binary file (63 kB). View file
 
FLaME/content/figures/fig_overview_tech.pdf ADDED
Binary file (165 kB). View file
 
FLaME/content/tables/by_task/causal_analysis.tex ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{tabular}{|l|cccc|cccc|}
2
+ \toprule
3
+ Dataset & \multicolumn{4}{c|}{Causal Detection} & \multicolumn{4}{c|}{Casual Classification} \\
4
+ \midrule
5
+ Metric & Accuracy & Precision & Recall & F1 & Precision & Recall & F1 & Accuracy \\
6
+ \midrule
7
+ Llama 3 70B Instruct & 0.148 & 0.429 & 0.148 & 0.142 & 0.241 & 0.329 & 0.192 & 0.198 \\
8
+ Llama 3 8B Instruct & 0.097 & 0.341 & 0.097 & 0.049 & 0.232 & 0.241 & 0.234 & \cellcolor{green!50}{0.380} \\
9
+ DBRX Instruct & 0.078 & 0.521 & 0.078 & 0.087 & 0.276 & 0.313 & 0.231 & 0.235 \\
10
+ DeepSeek LLM (67B) & 0.026 & 0.214 & 0.026 & 0.025 & 0.141 & 0.328 & 0.193 & 0.221 \\
11
+ Gemma 2 27B & 0.115 & 0.510 & 0.115 & 0.133 & 0.309 & 0.310 & 0.242 & 0.262 \\
12
+ Gemma 2 9B & 0.115 & 0.394 & 0.115 & 0.105 & 0.275 & 0.294 & 0.207 & 0.258 \\
13
+ Mistral (7B) Instruct v0.3 & 0.078 & 0.455 & 0.078 & 0.052 & 0.339 & \cellcolor{Green!70}{0.361} & 0.227 & 0.258 \\
14
+ Mixtral-8x22B Instruct & 0.131 & 0.486 & 0.131 & 0.125 & 0.344 & 0.310 & \cellcolor{Green!70}{0.308} & \cellcolor{green!20}{0.318} \\
15
+ Mixtral-8x7B Instruct & 0.088 & 0.510 & 0.088 & 0.055 & 0.308 & 0.314 & 0.229 & 0.273 \\
16
+ Qwen 2 Instruct (72B) & 0.139 & 0.489 & 0.139 & 0.190 & 0.208 & 0.330 & 0.184 & 0.188 \\
17
+ WizardLM-2 8x22B & 0.076 & 0.453 & 0.076 & 0.114 & 0.263 & 0.347 & 0.201 & 0.237 \\
18
+ DeepSeek-V3 & 0.164 & 0.528 & 0.164 & \cellcolor{green!20}{0.198} & 0.194 & 0.327 & 0.170 & 0.248 \\
19
+ DeepSeek R1 & \cellcolor{Green!70}{0.245} & \cellcolor{green!50}{0.643} & \cellcolor{Green!70}{0.245} & \cellcolor{Green!70}{0.337} & \cellcolor{Green!70}{0.385} & 0.318 & 0.202 & 0.221 \\
20
+ QwQ-32B-Preview & 0.110 & 0.473 & 0.110 & 0.131 & 0.193 & 0.262 & 0.220 & \cellcolor{Green!70}{0.465} \\
21
+ Jamba 1.5 Mini & 0.050 & 0.280 & 0.050 & 0.043 & 0.323 & 0.283 & \cellcolor{green!50}{0.270} & 0.295 \\
22
+ Jamba 1.5 Large & 0.076 & 0.517 & 0.076 & 0.074 & 0.268 & 0.248 & 0.176 & 0.200 \\
23
+ Claude 3.5 Sonnet & 0.154 & 0.564 & 0.154 & 0.196 & 0.259 & 0.336 & 0.197 & 0.235 \\
24
+ Claude 3 Haiku & 0.082 & 0.388 & 0.082 & 0.081 & \cellcolor{green!20}{0.369} & 0.347 & 0.200 & 0.203 \\
25
+ Cohere Command R 7B & 0.089 & 0.363 & 0.089 & 0.057 & \cellcolor{green!50}{0.379} & \cellcolor{green!20}{0.356} & \cellcolor{green!20}{0.255} & 0.275 \\
26
+ Cohere Command R + & 0.090 & 0.453 & 0.090 & 0.080 & 0.353 & 0.336 & 0.238 & 0.265 \\
27
+ Google Gemini 1.5 Pro & \cellcolor{green!20}{0.165} & 0.514 & \cellcolor{green!20}{0.165} & 0.196 & 0.265 & \cellcolor{green!50}{0.357} & 0.217 & 0.258 \\
28
+ OpenAI gpt-4o & 0.082 & \cellcolor{green!20}{0.576} & 0.082 & 0.130 & 0.254 & 0.327 & 0.222 & 0.235 \\
29
+ OpenAI o1-mini & \cellcolor{green!50}{0.206} & \cellcolor{Green!70}{0.648} & \cellcolor{green!50}{0.206} & \cellcolor{green!50}{0.289} & 0.325 & 0.316 & 0.209 & 0.233 \\
30
+ \bottomrule
31
+ \end{tabular}
FLaME/content/tables/by_task/information_retrieval.tex ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{tabular}{|l|cccc|cccc|cccc|cccc|cccc|}
2
+ \toprule
3
+ Dataset & \multicolumn{4}{c|}{FiNER} & \multicolumn{4}{c|}{FinRed} & \multicolumn{4}{c|}{ReFiND} & \multicolumn{4}{c|}{FNXL} & \multicolumn{4}{c|}{FinEntity} \\
4
+ \midrule
5
+ Metric & Precision & Recall & F1 & Accuracy & Accuracy & Precision & Recall & F1 & Accuracy & Precision & Recall & F1 & Precision & Recall & F1 & Accuracy & Precision & Recall & Accuracy & F1 \\
6
+ \midrule
7
+ Llama 3 70B Instruct & 0.715 & 0.693 & 0.701 & 0.911 & 0.314 & \cellcolor{green!20}{0.454} & 0.314 & 0.332 & 0.879 & 0.904 & 0.879 & 0.883 & 0.015 & 0.030 & 0.020 & 0.010 & 0.474 & 0.485 & 0.485 & 0.469 \\
8
+ Llama 3 8B Instruct & 0.581 & 0.558 & 0.565 & 0.854 & 0.296 & 0.357 & 0.296 & 0.289 & 0.723 & 0.755 & 0.723 & 0.705 & 0.003 & 0.004 & 0.003 & 0.002 & 0.301 & 0.478 & 0.478 & 0.350 \\
9
+ DBRX Instruct & 0.516 & 0.476 & 0.489 & 0.802 & 0.329 & 0.371 & 0.329 & 0.304 & 0.766 & 0.825 & 0.766 & 0.778 & 0.008 & 0.011 & 0.009 & 0.005 & 0.004 & 0.014 & 0.014 & 0.006 \\
10
+ DeepSeek LLM (67B) & 0.752 & 0.742 & 0.745 & 0.917 & 0.344 & 0.403 & 0.344 & 0.334 & 0.874 & 0.890 & 0.874 & 0.879 & 0.005 & 0.009 & 0.007 & 0.003 & 0.456 & 0.405 & 0.405 & 0.416 \\
11
+ Gemma 2 27B & 0.772 & 0.754 & 0.761 & \cellcolor{green!20}{0.923} & 0.352 & 0.437 & 0.352 & 0.356 & 0.897 & 0.914 & 0.897 & 0.902 & 0.005 & 0.008 & 0.006 & 0.003 & 0.320 & 0.295 & 0.295 & 0.298 \\
12
+ Gemma 2 9B & 0.665 & 0.643 & 0.651 & 0.886 & 0.336 & 0.373 & 0.336 & 0.331 & 0.885 & 0.902 & 0.885 & 0.892 & 0.004 & 0.008 & 0.005 & 0.003 & 0.348 & 0.419 & 0.419 & 0.367 \\
13
+ Mistral (7B) Instruct v0.3 & 0.540 & 0.522 & 0.526 & 0.806 & 0.278 & 0.383 & 0.278 & 0.276 & 0.767 & 0.817 & 0.767 & 0.771 & 0.004 & 0.006 & 0.004 & 0.002 & 0.337 & 0.477 & 0.477 & 0.368 \\
14
+ Mixtral-8x22B Instruct & 0.653 & 0.625 & 0.635 & 0.870 & 0.381 & 0.414 & 0.381 & 0.367 & 0.807 & 0.847 & 0.807 & 0.811 & 0.010 & 0.008 & 0.009 & 0.005 & 0.428 & 0.481 & 0.481 & 0.435 \\
15
+ Mixtral-8x7B Instruct & 0.613 & 0.591 & 0.598 & 0.875 & 0.291 & 0.376 & 0.291 & 0.282 & 0.840 & 0.863 & 0.840 & 0.845 & 0.007 & 0.012 & 0.009 & 0.005 & 0.251 & 0.324 & 0.324 & 0.267 \\
16
+ Qwen 2 Instruct (72B) & 0.766 & 0.742 & 0.748 & 0.899 & 0.365 & 0.407 & 0.365 & 0.348 & 0.850 & 0.881 & 0.850 & 0.854 & 0.010 & 0.016 & 0.012 & 0.006 & 0.468 & 0.530 & 0.530 & 0.483 \\
17
+ WizardLM-2 8x22B & 0.755 & 0.741 & 0.744 & 0.920 & 0.362 & 0.397 & 0.362 & 0.355 & 0.846 & 0.874 & 0.846 & 0.852 & 0.008 & 0.009 & 0.008 & 0.004 & 0.222 & 0.247 & 0.247 & 0.226 \\
18
+ DeepSeek-V3 & \cellcolor{green!20}{0.798} & \cellcolor{green!20}{0.787} & \cellcolor{green!20}{0.790} & \cellcolor{Green!70}{0.945} & \cellcolor{green!50}{0.450} & \cellcolor{green!50}{0.463} & \cellcolor{green!50}{0.450} & \cellcolor{green!50}{0.437} & 0.927 & \cellcolor{green!20}{0.943} & 0.927 & 0.934 & \cellcolor{green!50}{0.034} & \cellcolor{green!20}{0.067} & \cellcolor{green!20}{0.045} & \cellcolor{green!20}{0.023} & 0.563 & 0.544 & 0.544 & 0.549 \\
19
+ DeepSeek R1 & \cellcolor{Green!70}{0.813} & \cellcolor{Green!70}{0.805} & \cellcolor{Green!70}{0.807} & \cellcolor{green!50}{0.944} & \cellcolor{green!20}{0.412} & 0.424 & \cellcolor{green!20}{0.412} & 0.393 & \cellcolor{Green!70}{0.946} & \cellcolor{Green!70}{0.960} & \cellcolor{Green!70}{0.946} & \cellcolor{Green!70}{0.952} & \cellcolor{Green!70}{0.044} & \cellcolor{Green!70}{0.082} & \cellcolor{Green!70}{0.057} & \cellcolor{Green!70}{0.029} & \cellcolor{green!20}{0.600} & \cellcolor{green!20}{0.586} & \cellcolor{green!20}{0.586} & \cellcolor{green!20}{0.587} \\
20
+ QwQ-32B-Preview & 0.695 & 0.681 & 0.685 & 0.907 & 0.278 & 0.396 & 0.278 & 0.270 & 0.680 & 0.770 & 0.680 & 0.656 & 0.001 & 0.001 & 0.001 & 0.000 & 0.005 & 0.005 & 0.005 & 0.005 \\
21
+ Jamba 1.5 Mini & 0.564 & 0.556 & 0.552 & 0.818 & 0.308 & 0.450 & 0.308 & 0.284 & 0.830 & 0.864 & 0.830 & 0.844 & 0.004 & 0.006 & 0.005 & 0.003 & 0.119 & 0.182 & 0.182 & 0.132 \\
22
+ Jamba 1.5 Large & 0.707 & 0.687 & 0.693 & 0.883 & 0.341 & 0.452 & 0.341 & 0.341 & 0.856 & 0.890 & 0.856 & 0.862 & 0.004 & 0.005 & 0.005 & 0.002 & 0.403 & 0.414 & 0.414 & 0.397 \\
23
+ Claude 3.5 Sonnet & \cellcolor{green!50}{0.811} & \cellcolor{green!50}{0.794} & \cellcolor{green!50}{0.799} & 0.922 & \cellcolor{Green!70}{0.455} & \cellcolor{Green!70}{0.465} & \cellcolor{Green!70}{0.455} & \cellcolor{Green!70}{0.439} & 0.873 & 0.927 & 0.873 & 0.891 & \cellcolor{green!50}{0.034} & \cellcolor{green!50}{0.080} & \cellcolor{green!50}{0.047} & \cellcolor{green!50}{0.024} & \cellcolor{green!50}{0.658} & \cellcolor{green!50}{0.668} & \cellcolor{green!50}{0.668} & \cellcolor{green!50}{0.655} \\
24
+ Claude 3 Haiku & 0.732 & 0.700 & 0.711 & 0.895 & 0.294 & 0.330 & 0.294 & 0.285 & 0.879 & 0.917 & 0.879 & 0.883 & 0.011 & 0.022 & 0.015 & 0.008 & 0.498 & 0.517 & 0.517 & 0.494 \\
25
+ Cohere Command R + & 0.769 & 0.750 & 0.756 & 0.902 & 0.353 & 0.405 & 0.353 & 0.333 & 0.917 & 0.930 & 0.917 & 0.922 & 0.016 & 0.032 & 0.021 & 0.011 & 0.462 & 0.459 & 0.459 & 0.452 \\
26
+ Google Gemini 1.5 Pro & 0.728 & 0.705 & 0.712 & 0.891 & 0.373 & 0.436 & 0.373 & 0.374 & \cellcolor{green!50}{0.934} & \cellcolor{green!50}{0.955} & \cellcolor{green!50}{0.934} & \cellcolor{green!50}{0.944} & 0.014 & 0.028 & 0.019 & 0.010 & 0.399 & 0.400 & 0.400 & 0.393 \\
27
+ OpenAI gpt-4o & 0.778 & 0.760 & 0.766 & 0.911 & 0.402 & 0.445 & 0.402 & 0.399 & \cellcolor{green!20}{0.931} & \cellcolor{green!50}{0.955} & \cellcolor{green!20}{0.931} & \cellcolor{green!20}{0.942} & \cellcolor{green!20}{0.027} & 0.056 & 0.037 & 0.019 & 0.537 & 0.517 & 0.517 & 0.523 \\
28
+ OpenAI o1-mini & 0.772 & 0.755 & 0.761 & 0.922 & 0.407 & 0.444 & 0.407 & \cellcolor{green!20}{0.403} & 0.867 & 0.900 & 0.867 & 0.876 & 0.007 & 0.015 & 0.010 & 0.005 & \cellcolor{Green!70}{0.661} & \cellcolor{Green!70}{0.681} & \cellcolor{Green!70}{0.681} & \cellcolor{Green!70}{0.662} \\
29
+ \bottomrule
30
+ \end{tabular}