Spaces:
Running
Running
Glenn Matlin
commited on
Commit
·
632b302
1
Parent(s):
db2bd37
updates to index.html
Browse files- CLAUDE.md +33 -128
- FLaME/{FLaME__ACL_AAR_Feb_2025_.pdf → FLaME.pdf} +0 -0
- FLaME/FLaME.pdf:Zone.Identifier +4 -0
- FLaME/FLaME[ACLAARFeb2025]/content/0_authors.tex +1 -1
- FLaME/content/0_authors.tex +1 -1
- index.html +29 -29
CLAUDE.md
CHANGED
@@ -2,138 +2,43 @@
|
|
2 |
|
3 |
## Project Overview
|
4 |
- FLaME: Holistic Financial Language Model Evaluation
|
5 |
-
-
|
6 |
-
-
|
7 |
-
- Research paper for ACL Annual Advances in Research (Feb 2025)
|
8 |
- Hosted on HuggingFace Spaces
|
9 |
|
10 |
-
##
|
11 |
-
-
|
12 |
-
-
|
|
|
13 |
|
14 |
## Code Style Guidelines
|
15 |
-
- HTML:
|
16 |
-
- CSS: Follow Bulma
|
17 |
- JavaScript:
|
18 |
-
- Use camelCase for variables
|
19 |
-
-
|
20 |
-
- Include semicolons
|
21 |
-
-
|
22 |
-
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
-
|
30 |
-
-
|
31 |
-
|
32 |
-
|
33 |
-
-
|
34 |
-
|
35 |
-
|
36 |
-
-
|
37 |
-
-
|
38 |
-
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
- Keep all CSS in static/css/
|
44 |
-
- Keep all JavaScript in static/js/
|
45 |
-
- Keep media files in appropriate subdirectories
|
46 |
-
- Paper content in FLaME/content/
|
47 |
-
- Use section IDs for navigation linking (e.g., #abstract, #methodology)
|
48 |
-
|
49 |
-
## Interactive Components
|
50 |
-
- Navbar with smooth scrolling to sections
|
51 |
-
- Performance indicator bars for result visualization
|
52 |
-
- Card layouts for key findings with hover effects
|
53 |
-
- Methodology workflow diagram with step visualization
|
54 |
-
- Interactive feature highlights with icons
|
55 |
-
- Getting started guide with numbered steps
|
56 |
-
|
57 |
-
## Responsive Design
|
58 |
-
- Mobile-friendly navigation menu (hamburger on small screens)
|
59 |
-
- Stacked cards on mobile devices
|
60 |
-
- Adjusted typography and spacing for different screen sizes
|
61 |
-
- Media queries for breakpoints at 768px
|
62 |
-
|
63 |
-
## FLaME Research Paper Information
|
64 |
-
|
65 |
-
### Authors
|
66 |
-
- Oopy Goopy, General Munchkin Man, L'il Jim Bob, Larry
|
67 |
-
- Affiliation: Georgia Institute of Technology
|
68 |
-
|
69 |
-
### Paper Focus and Objective
|
70 |
-
- First comprehensive benchmarking framework for evaluating language models on financial NLP tasks
|
71 |
-
- Addresses gaps in existing evaluation methodologies for financial language models
|
72 |
-
- Provides standardized evaluation framework with open-source implementation
|
73 |
-
|
74 |
-
### Key Components
|
75 |
-
|
76 |
-
#### Taxonomy
|
77 |
-
- Organized by three dimensions: tasks, domains, and languages
|
78 |
-
- Six core FinNLP tasks:
|
79 |
-
1. Text classification
|
80 |
-
2. Sentiment analysis
|
81 |
-
3. Information retrieval
|
82 |
-
4. Causal analysis
|
83 |
-
5. Text summarization
|
84 |
-
6. Question answering
|
85 |
-
- Domains categorized by data source, origination, time period, etc.
|
86 |
-
- Currently focuses on English language
|
87 |
-
|
88 |
-
#### Datasets
|
89 |
-
Selected based on:
|
90 |
-
- Financial domain relevance
|
91 |
-
- Fair usage licensing
|
92 |
-
- Annotation quality
|
93 |
-
- Task substance
|
94 |
-
|
95 |
-
Key datasets include:
|
96 |
-
- Banking: Banking77, FiQA, FinRED
|
97 |
-
- Investment: FPB, Headlines, SubjectiveQA
|
98 |
-
- Accounting: FinQA, TaT-QA, ConvFinQA
|
99 |
-
- Corporate: ECTSum, EDTSum, FinCausal
|
100 |
-
- Monetary Policy: FOMC, FNXL
|
101 |
-
- Cross-domain: FinBench, NumClaim, ReFINED
|
102 |
-
|
103 |
-
#### Models Evaluated
|
104 |
-
- Proprietary closed-source: GPT-4o & o1-mini, Gemini-1.5, Claude3, Cohere Command R
|
105 |
-
- Open-weight: Llama-3, DeepSeekV3 & R-1, Qwen-2 & QwQ, Mistral, Gemma-1 & 2, Mixtral, WizardLM2, DBRX
|
106 |
-
- Used deterministic decoding (temperature 0.0, top p of 0.9, repetition penalty of 1)
|
107 |
-
|
108 |
-
#### Evaluation Process
|
109 |
-
- Two-stage approach: generation and extraction
|
110 |
-
- Task-specific metrics: accuracy, F1 scores, precision, recall, BLEU scores
|
111 |
-
- Standardized zero-shot evaluation
|
112 |
-
|
113 |
-
### Key Findings
|
114 |
-
- No single model performs best across all tasks
|
115 |
-
- Performance varies significantly based on domain and task structure
|
116 |
-
- Open-weight models show strong cost/performance efficiency
|
117 |
-
- Numeric reasoning tasks remain challenging for all models
|
118 |
-
- Inconsistent scaling: larger parameter sizes don't guarantee higher performance
|
119 |
-
- Models struggle with consistent numeric formats and longer label sets
|
120 |
-
- Top performers: DeepSeek R1, OpenAI o1-mini, Claude 3.5 Sonnet
|
121 |
-
|
122 |
-
### Limitations
|
123 |
-
- Limited dataset size and diversity
|
124 |
-
- Focus on zero-shot scenarios only
|
125 |
-
- English-language focus
|
126 |
-
- No evaluation of advanced prompting techniques
|
127 |
-
- Doesn't capture full breadth of real-world financial scenarios
|
128 |
-
|
129 |
-
### Future Directions
|
130 |
-
- More advanced prompt engineering
|
131 |
-
- Domain-adaptive training for numeric/causal tasks
|
132 |
-
- Benchmarking efficiency trade-offs
|
133 |
-
- Multi-lingual coverage expansion
|
134 |
-
|
135 |
-
### Resources
|
136 |
-
- Paper PDF: FLaME/FLaME__ACL_AAR_Feb_2025_.pdf
|
137 |
-
- ArXiv: https://arxiv.org/abs/2402.14017
|
138 |
- GitHub: https://github.com/flame-benchmark/flame
|
139 |
- HuggingFace: https://huggingface.co/spaces/flame-benchmark/flame
|
|
|
2 |
|
3 |
## Project Overview
|
4 |
- FLaME: Holistic Financial Language Model Evaluation
|
5 |
+
- LaTeX paper with static website built using Bulma CSS
|
6 |
+
- Research project for ACL Annual Advances in Research (Feb 2025)
|
|
|
7 |
- Hosted on HuggingFace Spaces
|
8 |
|
9 |
+
## Commands
|
10 |
+
- Build LaTeX paper: `pdflatex FLaME.tex && bibtex FLaME && pdflatex FLaME.tex && pdflatex FLaME.tex`
|
11 |
+
- Local website testing: `python -m http.server 8000`
|
12 |
+
- Fix tooltips: `./fix_tooltips.sh`
|
13 |
|
14 |
## Code Style Guidelines
|
15 |
+
- HTML: Semantic HTML5, Bulma framework conventions
|
16 |
+
- CSS: Follow Bulma conventions, keep in static/css/
|
17 |
- JavaScript:
|
18 |
+
- Use camelCase for variables/functions
|
19 |
+
- 2-space indentation
|
20 |
+
- Include semicolons
|
21 |
+
- Prefer vanilla JS
|
22 |
+
- Store in static/js/
|
23 |
+
- LaTeX: Follow ACL style guidelines in acl_formatting.md
|
24 |
+
|
25 |
+
## Paper Structure
|
26 |
+
- Main source in FLaME.tex
|
27 |
+
- Content modularized in FLaME/content/ directory
|
28 |
+
- Six core task areas: text classification, sentiment analysis, info retrieval, causal analysis, summarization, QA
|
29 |
+
- Datasets in FLaME/content/datasets/
|
30 |
+
- Results in FLaME/content/tables/
|
31 |
+
|
32 |
+
## Website Design
|
33 |
+
- Color scheme: Deep blue (#004d99), Orange (#ff6b00), Light bg (#f8f9fa)
|
34 |
+
- Card-based responsive layout
|
35 |
+
- Interactive elements with tooltips
|
36 |
+
- Results displayed in task-specific tables
|
37 |
+
- Media files optimized for web delivery
|
38 |
+
- Convert PDF figures to JPG/PNG for web display
|
39 |
+
|
40 |
+
## Deployment
|
41 |
+
- Website deploys to HuggingFace Spaces
|
42 |
+
- Paper available as PDF at FLaME/FLaME.pdf
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
- GitHub: https://github.com/flame-benchmark/flame
|
44 |
- HuggingFace: https://huggingface.co/spaces/flame-benchmark/flame
|
FLaME/{FLaME__ACL_AAR_Feb_2025_.pdf → FLaME.pdf}
RENAMED
Binary files a/FLaME/FLaME__ACL_AAR_Feb_2025_.pdf and b/FLaME/FLaME.pdf differ
|
|
FLaME/FLaME.pdf:Zone.Identifier
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[ZoneTransfer]
|
2 |
+
ZoneId=3
|
3 |
+
ReferrerUrl=https://www.overleaf.com/project/67380d90965fe3bdf7157e6c
|
4 |
+
HostUrl=https://www.overleaf.com/download/project/67380d90965fe3bdf7157e6c/build/195acbb99c5-1bdebd47a7495259/output/output.pdf?compileGroup=priority&clsiserverid=clsi-pre-emp-c2d-c-f-bzgb&enable_pdf_caching=true&popupDownload=true
|
FLaME/FLaME[ACLAARFeb2025]/content/0_authors.tex
CHANGED
@@ -1 +1 @@
|
|
1 |
-
\author{
|
|
|
1 |
+
\author{Glenn Matlin, {\bf Mika Okamoto}, {\bf Huzaifa Pardwala}, {\bf Yang Yang}, {\bf Sudheer Chava} \\ Georgia Institute of Technology}
|
FLaME/content/0_authors.tex
CHANGED
@@ -1 +1 @@
|
|
1 |
-
\author{
|
|
|
1 |
+
\author{Glenn Matlin, {\bf Mika Okamoto}, {\bf Huzaifa Pardwala}, {\bf Yang Yang}, {\bf Sudheer Chava} \\ Georgia Institute of Technology}
|
index.html
CHANGED
@@ -93,14 +93,15 @@
|
|
93 |
<h1 class="title is-1 publication-title">FLaME: Holistic Financial Language Model Evaluation</h1>
|
94 |
<div class="is-size-5 publication-authors">
|
95 |
<span class="author-block">
|
96 |
-
<a href="#" target="_blank">
|
97 |
<span class="author-block">
|
98 |
-
<a href="#" target="_blank">
|
99 |
<span class="author-block">
|
100 |
-
<a href="#" target="_blank">
|
101 |
-
|
|
|
102 |
<span class="author-block">
|
103 |
-
<a href="#" target="_blank">
|
104 |
</span>
|
105 |
</div>
|
106 |
|
@@ -112,7 +113,7 @@
|
|
112 |
<div class="publication-links">
|
113 |
<!-- PDF Link. -->
|
114 |
<span class="link-block">
|
115 |
-
<a href="FLaME/
|
116 |
class="external-link button is-normal is-rounded is-dark">
|
117 |
<span class="icon">
|
118 |
<i class="fas fa-file-pdf"></i>
|
@@ -3911,15 +3912,15 @@
|
|
3911 |
|
3912 |
<hr>
|
3913 |
|
3914 |
-
<p class="has-text-weight-bold mb-3"
|
3915 |
|
3916 |
<div class="notification is-info is-light py-3 px-4">
|
3917 |
-
<p><strong
|
3918 |
-
<p><strong
|
3919 |
-
<p><strong
|
3920 |
-
<p><strong
|
3921 |
-
<p><strong
|
3922 |
-
<p><strong
|
3923 |
</div>
|
3924 |
</div>
|
3925 |
</div>
|
@@ -4200,27 +4201,27 @@
|
|
4200 |
</div>
|
4201 |
|
4202 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4203 |
-
<p class="has-text-weight-bold mb-1"
|
4204 |
<p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
|
4205 |
</div>
|
4206 |
|
4207 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4208 |
-
<p class="has-text-weight-bold mb-1"
|
4209 |
<p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
|
4210 |
</div>
|
4211 |
|
4212 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4213 |
-
<p class="has-text-weight-bold mb-1"
|
4214 |
<p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
|
4215 |
</div>
|
4216 |
|
4217 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4218 |
-
<p class="has-text-weight-bold mb-1"
|
4219 |
<p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
|
4220 |
</div>
|
4221 |
|
4222 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4223 |
-
<p class="has-text-weight-bold mb-1"
|
4224 |
<p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
|
4225 |
</div>
|
4226 |
|
@@ -4299,7 +4300,7 @@
|
|
4299 |
|
4300 |
<div class="feature-item mb-3">
|
4301 |
<p class="has-text-weight-bold mb-1">
|
4302 |
-
<span class="icon has-text-primary"><i class="fas fa-check"></i></span>
|
4303 |
</p>
|
4304 |
<p class="is-size-7 ml-4">Ensures consistent evaluation metrics and transparent methodology.</p>
|
4305 |
</div>
|
@@ -4391,7 +4392,7 @@
|
|
4391 |
<div class="column is-6">
|
4392 |
<div class="dataset-category box">
|
4393 |
<p class="has-text-weight-bold">
|
4394 |
-
|
4395 |
</p>
|
4396 |
<ul>
|
4397 |
<li><strong>FinQA</strong> – Multi-step financial numerical reasoning.</li>
|
@@ -4405,7 +4406,7 @@
|
|
4405 |
<div class="column is-6">
|
4406 |
<div class="dataset-category box">
|
4407 |
<p class="has-text-weight-bold">
|
4408 |
-
|
4409 |
</p>
|
4410 |
<ul>
|
4411 |
<li><strong>ECTSum</strong> – Earnings call transcript summarization.</li>
|
@@ -4418,7 +4419,7 @@
|
|
4418 |
<div class="column is-6">
|
4419 |
<div class="dataset-category box">
|
4420 |
<p class="has-text-weight-bold">
|
4421 |
-
|
4422 |
</p>
|
4423 |
<ul>
|
4424 |
<li><strong>FiNER-ORD</strong> – Named entity recognition for financial documents.</li>
|
@@ -4434,7 +4435,7 @@
|
|
4434 |
<div class="column is-6">
|
4435 |
<div class="dataset-category box">
|
4436 |
<p class="has-text-weight-bold">
|
4437 |
-
|
4438 |
</p>
|
4439 |
<ul>
|
4440 |
<li><strong>FiQA (Task 1)</strong> – Aspect-based sentiment analysis.</li>
|
@@ -4449,7 +4450,7 @@
|
|
4449 |
<div class="column is-6">
|
4450 |
<div class="dataset-category box">
|
4451 |
<p class="has-text-weight-bold">
|
4452 |
-
|
4453 |
</p>
|
4454 |
<ul>
|
4455 |
<li><strong>Numerical Claim Detection</strong> – Fine-grained investor claim detection.</li>
|
@@ -4461,11 +4462,11 @@
|
|
4461 |
</div>
|
4462 |
</div>
|
4463 |
|
4464 |
-
<!--
|
4465 |
<div class="column is-6">
|
4466 |
<div class="dataset-category box">
|
4467 |
<p class="has-text-weight-bold">
|
4468 |
-
|
4469 |
</p>
|
4470 |
<ul>
|
4471 |
<li><strong>FinCausal</strong> – Causal reasoning in financial news.</li>
|
@@ -4493,7 +4494,6 @@
|
|
4493 |
<pre><code>@article{flame2025,
|
4494 |
author = {Goopy, Oopy and Man, General Munchkin and Bob, L'il Jim and Larry},
|
4495 |
title = {FLaME: Holistic Financial Language Model Evaluation},
|
4496 |
-
journal = {ACL Annual Advances in Research},
|
4497 |
year = {2025},
|
4498 |
month = {February},
|
4499 |
}</code></pre>
|
@@ -4510,7 +4510,7 @@
|
|
4510 |
<h4 class="has-text-white mb-4"><span class="flame">FLaME</span>: Financial Language Model Evaluation</h4>
|
4511 |
|
4512 |
<div class="footer-links mb-5">
|
4513 |
-
<a class="icon-link mr-3" target="_blank" href="FLaME/
|
4514 |
<i class="fas fa-file-pdf fa-lg"></i>
|
4515 |
</a>
|
4516 |
<a class="icon-link mr-3" href="https://arxiv.org/abs/2402.14017" target="_blank" title="View on arXiv">
|
@@ -4525,7 +4525,7 @@
|
|
4525 |
</div>
|
4526 |
|
4527 |
<div class="institution-info mb-4">
|
4528 |
-
<p class="has-text-white-ter">Georgia Institute of Technology
|
4529 |
</div>
|
4530 |
|
4531 |
<p class="has-text-white-ter is-size-7">
|
|
|
93 |
<h1 class="title is-1 publication-title">FLaME: Holistic Financial Language Model Evaluation</h1>
|
94 |
<div class="is-size-5 publication-authors">
|
95 |
<span class="author-block">
|
96 |
+
<a href="#" target="_blank">Glenn Matlin</a><sup>1</sup>,</span>
|
97 |
<span class="author-block">
|
98 |
+
<a href="#" target="_blank">Mika Okamoto</a><sup>1</sup>,</span>
|
99 |
<span class="author-block">
|
100 |
+
<a href="#" target="_blank">Huzaifa Pardwala</a><sup>1</sup>,</span>
|
101 |
+
<span class="author-block">
|
102 |
+
<a href="#" target="_blank">Yang Yang</a><sup>1</sup>,</span>
|
103 |
<span class="author-block">
|
104 |
+
<a href="#" target="_blank">Sudheer Chava</a><sup>1</sup>
|
105 |
</span>
|
106 |
</div>
|
107 |
|
|
|
113 |
<div class="publication-links">
|
114 |
<!-- PDF Link. -->
|
115 |
<span class="link-block">
|
116 |
+
<a href="FLaME/FLaME.pdf" target="_blank"
|
117 |
class="external-link button is-normal is-rounded is-dark">
|
118 |
<span class="icon">
|
119 |
<i class="fas fa-file-pdf"></i>
|
|
|
3912 |
|
3913 |
<hr>
|
3914 |
|
3915 |
+
<p class="has-text-weight-bold mb-3"><span class="icon has-text-primary"><i class="fa-solid fa-magnifying-glass"></i></span> Key Insights from Model Analysis</p>
|
3916 |
|
3917 |
<div class="notification is-info is-light py-3 px-4">
|
3918 |
+
<p><strong><span class="icon"><i class="fa-solid fa-trophy"></i></span> No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
|
3919 |
+
<p><strong><span class="icon"><i class="fa-solid fa-balance-scale"></i></span> Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
|
3920 |
+
<p><strong><span class="icon"><i class="fa-solid fa-tools"></i></span> Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
|
3921 |
+
<p><strong><span class="icon"><i class="fa-solid fa-coins"></i></span> Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
|
3922 |
+
<p><strong><span class="icon"><i class="fa-solid fa-chart-line"></i></span> Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
|
3923 |
+
<p><strong><span class="icon"><i class="fa-solid fa-list-ol"></i></span> Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
|
3924 |
</div>
|
3925 |
</div>
|
3926 |
</div>
|
|
|
4201 |
</div>
|
4202 |
|
4203 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4204 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-brain"></i></span> Few-Shot & Chain-of-Thought</p>
|
4205 |
<p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
|
4206 |
</div>
|
4207 |
|
4208 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4209 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-line"></i></span> Domain-Adaptive Training</p>
|
4210 |
<p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
|
4211 |
</div>
|
4212 |
|
4213 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4214 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-database"></i></span> Expanded Dataset Coverage</p>
|
4215 |
<p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
|
4216 |
</div>
|
4217 |
|
4218 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4219 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-balance-scale"></i></span> Efficiency & Cost Benchmarking</p>
|
4220 |
<p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
|
4221 |
</div>
|
4222 |
|
4223 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
4224 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-bar"></i></span> Advanced Evaluation Metrics</p>
|
4225 |
<p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
|
4226 |
</div>
|
4227 |
|
|
|
4300 |
|
4301 |
<div class="feature-item mb-3">
|
4302 |
<p class="has-text-weight-bold mb-1">
|
4303 |
+
<span class="icon has-text-primary"><i class="fas fa-check"></i></span> Reproducible Benchmarking
|
4304 |
</p>
|
4305 |
<p class="is-size-7 ml-4">Ensures consistent evaluation metrics and transparent methodology.</p>
|
4306 |
</div>
|
|
|
4392 |
<div class="column is-6">
|
4393 |
<div class="dataset-category box">
|
4394 |
<p class="has-text-weight-bold">
|
4395 |
+
📊 Numerical Reasoning & Question Answering
|
4396 |
</p>
|
4397 |
<ul>
|
4398 |
<li><strong>FinQA</strong> – Multi-step financial numerical reasoning.</li>
|
|
|
4406 |
<div class="column is-6">
|
4407 |
<div class="dataset-category box">
|
4408 |
<p class="has-text-weight-bold">
|
4409 |
+
📝 Text Summarization
|
4410 |
</p>
|
4411 |
<ul>
|
4412 |
<li><strong>ECTSum</strong> – Earnings call transcript summarization.</li>
|
|
|
4419 |
<div class="column is-6">
|
4420 |
<div class="dataset-category box">
|
4421 |
<p class="has-text-weight-bold">
|
4422 |
+
🔎 Information Retrieval
|
4423 |
</p>
|
4424 |
<ul>
|
4425 |
<li><strong>FiNER-ORD</strong> – Named entity recognition for financial documents.</li>
|
|
|
4435 |
<div class="column is-6">
|
4436 |
<div class="dataset-category box">
|
4437 |
<p class="has-text-weight-bold">
|
4438 |
+
😐 Sentiment Analysis
|
4439 |
</p>
|
4440 |
<ul>
|
4441 |
<li><strong>FiQA (Task 1)</strong> – Aspect-based sentiment analysis.</li>
|
|
|
4450 |
<div class="column is-6">
|
4451 |
<div class="dataset-category box">
|
4452 |
<p class="has-text-weight-bold">
|
4453 |
+
🏷️ Text Classification
|
4454 |
</p>
|
4455 |
<ul>
|
4456 |
<li><strong>Numerical Claim Detection</strong> – Fine-grained investor claim detection.</li>
|
|
|
4462 |
</div>
|
4463 |
</div>
|
4464 |
|
4465 |
+
<!-- 🧠 Causal Analysis -->
|
4466 |
<div class="column is-6">
|
4467 |
<div class="dataset-category box">
|
4468 |
<p class="has-text-weight-bold">
|
4469 |
+
🧠 Causal Analysis
|
4470 |
</p>
|
4471 |
<ul>
|
4472 |
<li><strong>FinCausal</strong> – Causal reasoning in financial news.</li>
|
|
|
4494 |
<pre><code>@article{flame2025,
|
4495 |
author = {Goopy, Oopy and Man, General Munchkin and Bob, L'il Jim and Larry},
|
4496 |
title = {FLaME: Holistic Financial Language Model Evaluation},
|
|
|
4497 |
year = {2025},
|
4498 |
month = {February},
|
4499 |
}</code></pre>
|
|
|
4510 |
<h4 class="has-text-white mb-4"><span class="flame">FLaME</span>: Financial Language Model Evaluation</h4>
|
4511 |
|
4512 |
<div class="footer-links mb-5">
|
4513 |
+
<a class="icon-link mr-3" target="_blank" href="FLaME/FLaME.pdf" title="Download PDF">
|
4514 |
<i class="fas fa-file-pdf fa-lg"></i>
|
4515 |
</a>
|
4516 |
<a class="icon-link mr-3" href="https://arxiv.org/abs/2402.14017" target="_blank" title="View on arXiv">
|
|
|
4525 |
</div>
|
4526 |
|
4527 |
<div class="institution-info mb-4">
|
4528 |
+
<p class="has-text-white-ter">Georgia Institute of Technology</p>
|
4529 |
</div>
|
4530 |
|
4531 |
<p class="has-text-white-ter is-size-7">
|