Glenn Matlin commited on
Commit
632b302
·
1 Parent(s): db2bd37

updates to index.html

Browse files
CLAUDE.md CHANGED
@@ -2,138 +2,43 @@
2
 
3
  ## Project Overview
4
  - FLaME: Holistic Financial Language Model Evaluation
5
- - Static website built with Bulma CSS framework
6
- - No build system (pure HTML/CSS/JavaScript)
7
- - Research paper for ACL Annual Advances in Research (Feb 2025)
8
  - Hosted on HuggingFace Spaces
9
 
10
- ## Serving the Website
11
- - For local testing: `python -m http.server 8000` (will serve from current directory)
12
- - Open browser to http://localhost:8000
 
13
 
14
  ## Code Style Guidelines
15
- - HTML: Follow semantic HTML5 practices
16
- - CSS: Follow Bulma framework conventions
17
  - JavaScript:
18
- - Use camelCase for variables and functions
19
- - Use 2-space indentation
20
- - Include semicolons after statements
21
- - Use function declarations with 'function' keyword
22
- - Prefer vanilla JS with minimal dependencies
23
-
24
- ## Website Design Guidelines
25
- - Color scheme:
26
- - Primary: Deep blue (#004d99) for finance theme
27
- - Secondary: Orange (#ff6b00) for "flame" accent
28
- - Light background (#f8f9fa) for content areas
29
- - Card-based layout for content organization
30
- - Interactive elements with hover effects
31
- - Mobile-responsive design
32
- - Navigation menu with fixed positioning
33
- - Footer with institutional information and resource links
34
-
35
- ## Media Guidelines
36
- - Image formats: Prefer .jpg for photos, .svg for vector graphics
37
- - Video formats: Use .mp4 for compatibility
38
- - Optimize media files for web delivery
39
- - Paper figures are in PDF format in FLaME/content/figures/
40
- - For web display, convert PDF figures to PNG/JPG formats
41
-
42
- ## Structure
43
- - Keep all CSS in static/css/
44
- - Keep all JavaScript in static/js/
45
- - Keep media files in appropriate subdirectories
46
- - Paper content in FLaME/content/
47
- - Use section IDs for navigation linking (e.g., #abstract, #methodology)
48
-
49
- ## Interactive Components
50
- - Navbar with smooth scrolling to sections
51
- - Performance indicator bars for result visualization
52
- - Card layouts for key findings with hover effects
53
- - Methodology workflow diagram with step visualization
54
- - Interactive feature highlights with icons
55
- - Getting started guide with numbered steps
56
-
57
- ## Responsive Design
58
- - Mobile-friendly navigation menu (hamburger on small screens)
59
- - Stacked cards on mobile devices
60
- - Adjusted typography and spacing for different screen sizes
61
- - Media queries for breakpoints at 768px
62
-
63
- ## FLaME Research Paper Information
64
-
65
- ### Authors
66
- - Oopy Goopy, General Munchkin Man, L'il Jim Bob, Larry
67
- - Affiliation: Georgia Institute of Technology
68
-
69
- ### Paper Focus and Objective
70
- - First comprehensive benchmarking framework for evaluating language models on financial NLP tasks
71
- - Addresses gaps in existing evaluation methodologies for financial language models
72
- - Provides standardized evaluation framework with open-source implementation
73
-
74
- ### Key Components
75
-
76
- #### Taxonomy
77
- - Organized by three dimensions: tasks, domains, and languages
78
- - Six core FinNLP tasks:
79
- 1. Text classification
80
- 2. Sentiment analysis
81
- 3. Information retrieval
82
- 4. Causal analysis
83
- 5. Text summarization
84
- 6. Question answering
85
- - Domains categorized by data source, origination, time period, etc.
86
- - Currently focuses on English language
87
-
88
- #### Datasets
89
- Selected based on:
90
- - Financial domain relevance
91
- - Fair usage licensing
92
- - Annotation quality
93
- - Task substance
94
-
95
- Key datasets include:
96
- - Banking: Banking77, FiQA, FinRED
97
- - Investment: FPB, Headlines, SubjectiveQA
98
- - Accounting: FinQA, TaT-QA, ConvFinQA
99
- - Corporate: ECTSum, EDTSum, FinCausal
100
- - Monetary Policy: FOMC, FNXL
101
- - Cross-domain: FinBench, NumClaim, ReFINED
102
-
103
- #### Models Evaluated
104
- - Proprietary closed-source: GPT-4o & o1-mini, Gemini-1.5, Claude3, Cohere Command R
105
- - Open-weight: Llama-3, DeepSeekV3 & R-1, Qwen-2 & QwQ, Mistral, Gemma-1 & 2, Mixtral, WizardLM2, DBRX
106
- - Used deterministic decoding (temperature 0.0, top p of 0.9, repetition penalty of 1)
107
-
108
- #### Evaluation Process
109
- - Two-stage approach: generation and extraction
110
- - Task-specific metrics: accuracy, F1 scores, precision, recall, BLEU scores
111
- - Standardized zero-shot evaluation
112
-
113
- ### Key Findings
114
- - No single model performs best across all tasks
115
- - Performance varies significantly based on domain and task structure
116
- - Open-weight models show strong cost/performance efficiency
117
- - Numeric reasoning tasks remain challenging for all models
118
- - Inconsistent scaling: larger parameter sizes don't guarantee higher performance
119
- - Models struggle with consistent numeric formats and longer label sets
120
- - Top performers: DeepSeek R1, OpenAI o1-mini, Claude 3.5 Sonnet
121
-
122
- ### Limitations
123
- - Limited dataset size and diversity
124
- - Focus on zero-shot scenarios only
125
- - English-language focus
126
- - No evaluation of advanced prompting techniques
127
- - Doesn't capture full breadth of real-world financial scenarios
128
-
129
- ### Future Directions
130
- - More advanced prompt engineering
131
- - Domain-adaptive training for numeric/causal tasks
132
- - Benchmarking efficiency trade-offs
133
- - Multi-lingual coverage expansion
134
-
135
- ### Resources
136
- - Paper PDF: FLaME/FLaME__ACL_AAR_Feb_2025_.pdf
137
- - ArXiv: https://arxiv.org/abs/2402.14017
138
  - GitHub: https://github.com/flame-benchmark/flame
139
  - HuggingFace: https://huggingface.co/spaces/flame-benchmark/flame
 
2
 
3
  ## Project Overview
4
  - FLaME: Holistic Financial Language Model Evaluation
5
+ - LaTeX paper with static website built using Bulma CSS
6
+ - Research project for ACL Annual Advances in Research (Feb 2025)
 
7
  - Hosted on HuggingFace Spaces
8
 
9
+ ## Commands
10
+ - Build LaTeX paper: `pdflatex FLaME.tex && bibtex FLaME && pdflatex FLaME.tex && pdflatex FLaME.tex`
11
+ - Local website testing: `python -m http.server 8000`
12
+ - Fix tooltips: `./fix_tooltips.sh`
13
 
14
  ## Code Style Guidelines
15
+ - HTML: Semantic HTML5, Bulma framework conventions
16
+ - CSS: Follow Bulma conventions, keep in static/css/
17
  - JavaScript:
18
+ - Use camelCase for variables/functions
19
+ - 2-space indentation
20
+ - Include semicolons
21
+ - Prefer vanilla JS
22
+ - Store in static/js/
23
+ - LaTeX: Follow ACL style guidelines in acl_formatting.md
24
+
25
+ ## Paper Structure
26
+ - Main source in FLaME.tex
27
+ - Content modularized in FLaME/content/ directory
28
+ - Six core task areas: text classification, sentiment analysis, info retrieval, causal analysis, summarization, QA
29
+ - Datasets in FLaME/content/datasets/
30
+ - Results in FLaME/content/tables/
31
+
32
+ ## Website Design
33
+ - Color scheme: Deep blue (#004d99), Orange (#ff6b00), Light bg (#f8f9fa)
34
+ - Card-based responsive layout
35
+ - Interactive elements with tooltips
36
+ - Results displayed in task-specific tables
37
+ - Media files optimized for web delivery
38
+ - Convert PDF figures to JPG/PNG for web display
39
+
40
+ ## Deployment
41
+ - Website deploys to HuggingFace Spaces
42
+ - Paper available as PDF at FLaME/FLaME.pdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  - GitHub: https://github.com/flame-benchmark/flame
44
  - HuggingFace: https://huggingface.co/spaces/flame-benchmark/flame
FLaME/{FLaME__ACL_AAR_Feb_2025_.pdf → FLaME.pdf} RENAMED
Binary files a/FLaME/FLaME__ACL_AAR_Feb_2025_.pdf and b/FLaME/FLaME.pdf differ
 
FLaME/FLaME.pdf:Zone.Identifier ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ [ZoneTransfer]
2
+ ZoneId=3
3
+ ReferrerUrl=https://www.overleaf.com/project/67380d90965fe3bdf7157e6c
4
+ HostUrl=https://www.overleaf.com/download/project/67380d90965fe3bdf7157e6c/build/195acbb99c5-1bdebd47a7495259/output/output.pdf?compileGroup=priority&clsiserverid=clsi-pre-emp-c2d-c-f-bzgb&enable_pdf_caching=true&popupDownload=true
FLaME/FLaME[ACLAARFeb2025]/content/0_authors.tex CHANGED
@@ -1 +1 @@
1
- \author{Oopy Goopy, {\bf General Munchkin Man}, {\bf L'il Jim Bob}, {\bf Larry} \\ Georgia Institute of Technology}
 
1
+ \author{Glenn Matlin, {\bf Mika Okamoto}, {\bf Huzaifa Pardwala}, {\bf Yang Yang}, {\bf Sudheer Chava} \\ Georgia Institute of Technology}
FLaME/content/0_authors.tex CHANGED
@@ -1 +1 @@
1
- \author{Oopy Goopy, {\bf General Munchkin Man}, {\bf L'il Jim Bob}, {\bf Larry} \\ Georgia Institute of Technology}
 
1
+ \author{Glenn Matlin, {\bf Mika Okamoto}, {\bf Huzaifa Pardwala}, {\bf Yang Yang}, {\bf Sudheer Chava} \\ Georgia Institute of Technology}
index.html CHANGED
@@ -93,14 +93,15 @@
93
  <h1 class="title is-1 publication-title">FLaME: Holistic Financial Language Model Evaluation</h1>
94
  <div class="is-size-5 publication-authors">
95
  <span class="author-block">
96
- <a href="#" target="_blank">Oopy Goopy</a><sup>1</sup>,</span>
97
  <span class="author-block">
98
- <a href="#" target="_blank">General Munchkin Man</a><sup>1</sup>,</span>
99
  <span class="author-block">
100
- <a href="#" target="_blank">L'il Jim Bob</a><sup>1</sup>,
101
- </span>
 
102
  <span class="author-block">
103
- <a href="#" target="_blank">Larry</a><sup>1</sup>
104
  </span>
105
  </div>
106
 
@@ -112,7 +113,7 @@
112
  <div class="publication-links">
113
  <!-- PDF Link. -->
114
  <span class="link-block">
115
- <a href="FLaME/FLaME__ACL_AAR_Feb_2025_.pdf" target="_blank"
116
  class="external-link button is-normal is-rounded is-dark">
117
  <span class="icon">
118
  <i class="fas fa-file-pdf"></i>
@@ -3911,15 +3912,15 @@
3911
 
3912
  <hr>
3913
 
3914
- <p class="has-text-weight-bold mb-3">🔍 Key Insights from Model Analysis</p>
3915
 
3916
  <div class="notification is-info is-light py-3 px-4">
3917
- <p><strong>🏆 No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
3918
- <p><strong>⚖️ Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
3919
- <p><strong>🛠️ Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
3920
- <p><strong>💰 Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
3921
- <p><strong>📉 Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
3922
- <p><strong>🔢 Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
3923
  </div>
3924
  </div>
3925
  </div>
@@ -4200,27 +4201,27 @@
4200
  </div>
4201
 
4202
  <div class="notification is-info is-light py-2 px-3 mb-3">
4203
- <p class="has-text-weight-bold mb-1">🧠 Few-Shot & Chain-of-Thought</p>
4204
  <p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
4205
  </div>
4206
 
4207
  <div class="notification is-info is-light py-2 px-3 mb-3">
4208
- <p class="has-text-weight-bold mb-1">📊 Domain-Adaptive Training</p>
4209
  <p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
4210
  </div>
4211
 
4212
  <div class="notification is-info is-light py-2 px-3 mb-3">
4213
- <p class="has-text-weight-bold mb-1">🔍 Expanded Dataset Coverage</p>
4214
  <p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
4215
  </div>
4216
 
4217
  <div class="notification is-info is-light py-2 px-3 mb-3">
4218
- <p class="has-text-weight-bold mb-1">⚖️ Efficiency & Cost Benchmarking</p>
4219
  <p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
4220
  </div>
4221
 
4222
  <div class="notification is-info is-light py-2 px-3 mb-3">
4223
- <p class="has-text-weight-bold mb-1">📈 Advanced Evaluation Metrics</p>
4224
  <p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
4225
  </div>
4226
 
@@ -4299,7 +4300,7 @@
4299
 
4300
  <div class="feature-item mb-3">
4301
  <p class="has-text-weight-bold mb-1">
4302
- <span class="icon has-text-primary"><i class="fas fa-check"></i></span> 📊 Reproducible Benchmarking
4303
  </p>
4304
  <p class="is-size-7 ml-4">Ensures consistent evaluation metrics and transparent methodology.</p>
4305
  </div>
@@ -4391,7 +4392,7 @@
4391
  <div class="column is-6">
4392
  <div class="dataset-category box">
4393
  <p class="has-text-weight-bold">
4394
- <span class="icon has-text-primary"><i class="fa-solid fa-calculator"></i></span> 📊 Numerical Reasoning & Question Answering
4395
  </p>
4396
  <ul>
4397
  <li><strong>FinQA</strong> – Multi-step financial numerical reasoning.</li>
@@ -4405,7 +4406,7 @@
4405
  <div class="column is-6">
4406
  <div class="dataset-category box">
4407
  <p class="has-text-weight-bold">
4408
- <span class="icon has-text-primary"><i class="fa-solid fa-file-lines"></i></span> 📝 Text Summarization
4409
  </p>
4410
  <ul>
4411
  <li><strong>ECTSum</strong> – Earnings call transcript summarization.</li>
@@ -4418,7 +4419,7 @@
4418
  <div class="column is-6">
4419
  <div class="dataset-category box">
4420
  <p class="has-text-weight-bold">
4421
- <span class="icon has-text-primary"><i class="fa-solid fa-search"></i></span> 🔎 Information Retrieval
4422
  </p>
4423
  <ul>
4424
  <li><strong>FiNER-ORD</strong> – Named entity recognition for financial documents.</li>
@@ -4434,7 +4435,7 @@
4434
  <div class="column is-6">
4435
  <div class="dataset-category box">
4436
  <p class="has-text-weight-bold">
4437
- <span class="icon has-text-primary"><i class="fa-solid fa-comment-alt"></i></span> 😐 Sentiment Analysis
4438
  </p>
4439
  <ul>
4440
  <li><strong>FiQA (Task 1)</strong> – Aspect-based sentiment analysis.</li>
@@ -4449,7 +4450,7 @@
4449
  <div class="column is-6">
4450
  <div class="dataset-category box">
4451
  <p class="has-text-weight-bold">
4452
- <span class="icon has-text-primary"><i class="fa-solid fa-tags"></i></span> 🏷️ Text Classification
4453
  </p>
4454
  <ul>
4455
  <li><strong>Numerical Claim Detection</strong> – Fine-grained investor claim detection.</li>
@@ -4461,11 +4462,11 @@
4461
  </div>
4462
  </div>
4463
 
4464
- <!-- Causal Analysis -->
4465
  <div class="column is-6">
4466
  <div class="dataset-category box">
4467
  <p class="has-text-weight-bold">
4468
- <span class="icon"><i class="fa-solid fa-brain"></i></span> 🧠 Causal Analysis
4469
  </p>
4470
  <ul>
4471
  <li><strong>FinCausal</strong> – Causal reasoning in financial news.</li>
@@ -4493,7 +4494,6 @@
4493
  <pre><code>@article{flame2025,
4494
  author = {Goopy, Oopy and Man, General Munchkin and Bob, L'il Jim and Larry},
4495
  title = {FLaME: Holistic Financial Language Model Evaluation},
4496
- journal = {ACL Annual Advances in Research},
4497
  year = {2025},
4498
  month = {February},
4499
  }</code></pre>
@@ -4510,7 +4510,7 @@
4510
  <h4 class="has-text-white mb-4"><span class="flame">FLaME</span>: Financial Language Model Evaluation</h4>
4511
 
4512
  <div class="footer-links mb-5">
4513
- <a class="icon-link mr-3" target="_blank" href="FLaME/FLaME__ACL_AAR_Feb_2025_.pdf" title="Download PDF">
4514
  <i class="fas fa-file-pdf fa-lg"></i>
4515
  </a>
4516
  <a class="icon-link mr-3" href="https://arxiv.org/abs/2402.14017" target="_blank" title="View on arXiv">
@@ -4525,7 +4525,7 @@
4525
  </div>
4526
 
4527
  <div class="institution-info mb-4">
4528
- <p class="has-text-white-ter">Georgia Institute of Technology | ACL 2025</p>
4529
  </div>
4530
 
4531
  <p class="has-text-white-ter is-size-7">
 
93
  <h1 class="title is-1 publication-title">FLaME: Holistic Financial Language Model Evaluation</h1>
94
  <div class="is-size-5 publication-authors">
95
  <span class="author-block">
96
+ <a href="#" target="_blank">Glenn Matlin</a><sup>1</sup>,</span>
97
  <span class="author-block">
98
+ <a href="#" target="_blank">Mika Okamoto</a><sup>1</sup>,</span>
99
  <span class="author-block">
100
+ <a href="#" target="_blank">Huzaifa Pardwala</a><sup>1</sup>,</span>
101
+ <span class="author-block">
102
+ <a href="#" target="_blank">Yang Yang</a><sup>1</sup>,</span>
103
  <span class="author-block">
104
+ <a href="#" target="_blank">Sudheer Chava</a><sup>1</sup>
105
  </span>
106
  </div>
107
 
 
113
  <div class="publication-links">
114
  <!-- PDF Link. -->
115
  <span class="link-block">
116
+ <a href="FLaME/FLaME.pdf" target="_blank"
117
  class="external-link button is-normal is-rounded is-dark">
118
  <span class="icon">
119
  <i class="fas fa-file-pdf"></i>
 
3912
 
3913
  <hr>
3914
 
3915
+ <p class="has-text-weight-bold mb-3"><span class="icon has-text-primary"><i class="fa-solid fa-magnifying-glass"></i></span> Key Insights from Model Analysis</p>
3916
 
3917
  <div class="notification is-info is-light py-3 px-4">
3918
+ <p><strong><span class="icon"><i class="fa-solid fa-trophy"></i></span> No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
3919
+ <p><strong><span class="icon"><i class="fa-solid fa-balance-scale"></i></span> Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
3920
+ <p><strong><span class="icon"><i class="fa-solid fa-tools"></i></span> Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
3921
+ <p><strong><span class="icon"><i class="fa-solid fa-coins"></i></span> Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
3922
+ <p><strong><span class="icon"><i class="fa-solid fa-chart-line"></i></span> Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
3923
+ <p><strong><span class="icon"><i class="fa-solid fa-list-ol"></i></span> Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
3924
  </div>
3925
  </div>
3926
  </div>
 
4201
  </div>
4202
 
4203
  <div class="notification is-info is-light py-2 px-3 mb-3">
4204
+ <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-brain"></i></span> Few-Shot & Chain-of-Thought</p>
4205
  <p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
4206
  </div>
4207
 
4208
  <div class="notification is-info is-light py-2 px-3 mb-3">
4209
+ <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-line"></i></span> Domain-Adaptive Training</p>
4210
  <p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
4211
  </div>
4212
 
4213
  <div class="notification is-info is-light py-2 px-3 mb-3">
4214
+ <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-database"></i></span> Expanded Dataset Coverage</p>
4215
  <p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
4216
  </div>
4217
 
4218
  <div class="notification is-info is-light py-2 px-3 mb-3">
4219
+ <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-balance-scale"></i></span> Efficiency & Cost Benchmarking</p>
4220
  <p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
4221
  </div>
4222
 
4223
  <div class="notification is-info is-light py-2 px-3 mb-3">
4224
+ <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-bar"></i></span> Advanced Evaluation Metrics</p>
4225
  <p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
4226
  </div>
4227
 
 
4300
 
4301
  <div class="feature-item mb-3">
4302
  <p class="has-text-weight-bold mb-1">
4303
+ <span class="icon has-text-primary"><i class="fas fa-check"></i></span> Reproducible Benchmarking
4304
  </p>
4305
  <p class="is-size-7 ml-4">Ensures consistent evaluation metrics and transparent methodology.</p>
4306
  </div>
 
4392
  <div class="column is-6">
4393
  <div class="dataset-category box">
4394
  <p class="has-text-weight-bold">
4395
+ 📊 Numerical Reasoning & Question Answering
4396
  </p>
4397
  <ul>
4398
  <li><strong>FinQA</strong> – Multi-step financial numerical reasoning.</li>
 
4406
  <div class="column is-6">
4407
  <div class="dataset-category box">
4408
  <p class="has-text-weight-bold">
4409
+ 📝 Text Summarization
4410
  </p>
4411
  <ul>
4412
  <li><strong>ECTSum</strong> – Earnings call transcript summarization.</li>
 
4419
  <div class="column is-6">
4420
  <div class="dataset-category box">
4421
  <p class="has-text-weight-bold">
4422
+ 🔎 Information Retrieval
4423
  </p>
4424
  <ul>
4425
  <li><strong>FiNER-ORD</strong> – Named entity recognition for financial documents.</li>
 
4435
  <div class="column is-6">
4436
  <div class="dataset-category box">
4437
  <p class="has-text-weight-bold">
4438
+ 😐 Sentiment Analysis
4439
  </p>
4440
  <ul>
4441
  <li><strong>FiQA (Task 1)</strong> – Aspect-based sentiment analysis.</li>
 
4450
  <div class="column is-6">
4451
  <div class="dataset-category box">
4452
  <p class="has-text-weight-bold">
4453
+ 🏷️ Text Classification
4454
  </p>
4455
  <ul>
4456
  <li><strong>Numerical Claim Detection</strong> – Fine-grained investor claim detection.</li>
 
4462
  </div>
4463
  </div>
4464
 
4465
+ <!-- 🧠 Causal Analysis -->
4466
  <div class="column is-6">
4467
  <div class="dataset-category box">
4468
  <p class="has-text-weight-bold">
4469
+ 🧠 Causal Analysis
4470
  </p>
4471
  <ul>
4472
  <li><strong>FinCausal</strong> – Causal reasoning in financial news.</li>
 
4494
  <pre><code>@article{flame2025,
4495
  author = {Goopy, Oopy and Man, General Munchkin and Bob, L'il Jim and Larry},
4496
  title = {FLaME: Holistic Financial Language Model Evaluation},
 
4497
  year = {2025},
4498
  month = {February},
4499
  }</code></pre>
 
4510
  <h4 class="has-text-white mb-4"><span class="flame">FLaME</span>: Financial Language Model Evaluation</h4>
4511
 
4512
  <div class="footer-links mb-5">
4513
+ <a class="icon-link mr-3" target="_blank" href="FLaME/FLaME.pdf" title="Download PDF">
4514
  <i class="fas fa-file-pdf fa-lg"></i>
4515
  </a>
4516
  <a class="icon-link mr-3" href="https://arxiv.org/abs/2402.14017" target="_blank" title="View on arXiv">
 
4525
  </div>
4526
 
4527
  <div class="institution-info mb-4">
4528
+ <p class="has-text-white-ter">Georgia Institute of Technology</p>
4529
  </div>
4530
 
4531
  <p class="has-text-white-ter is-size-7">