FLaME / final_results.html
mokamoto's picture
tables and figures
227ea73
<\!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>FLaME Results</title>
<link rel="stylesheet" href="static/css/bulma.min.css">
<link rel="stylesheet" href="static/css/index.css">
<link rel="stylesheet" href="static/css/results.css">
</head>
<body>
<nav class="navbar is-fixed-top is-primary" role="navigation" aria-label="main navigation">
<div class="container">
<div class="navbar-brand">
<a class="navbar-item" href="index.html">
<strong>FLaME</strong>
</a>
</div>
<div class="navbar-menu">
<div class="navbar-end">
<a class="navbar-item" href="index.html">
Home
</a>
<a class="navbar-item is-active" href="results.html">
Results
</a>
</div>
</div>
</div>
</nav>
<section class="section">
<div class="container">
<h1 class="title is-2">FLaME: Financial Language Model Evaluation Results</h1>
<div class="content">
<p class="is-size-5">
This page presents the results of the FLaME evaluation across various financial NLP tasks.
Each tab shows performance metrics for different task categories.
</p>
</div>
<div class="tabs-container">
<div class="tabs is-centered is-boxed">
<ul>
<li class="is-active" data-tab="main">
<a>
<span>All Tasks</span>
</a>
</li>
<li data-tab="causal-analysis">
<a>
<span>Causal Analysis</span>
</a>
</li>
<li data-tab="information-retrieval">
<a>
<span>Information Retrieval</span>
</a>
</li>
<li data-tab="question-answering">
<a>
<span>Question Answering</span>
</a>
</li>
<li data-tab="sentiment-analysis">
<a>
<span>Sentiment Analysis</span>
</a>
</li>
<li data-tab="text-classification">
<a>
<span>Text Classification</span>
</a>
</li>
<li data-tab="text-summarization">
<a>
<span>Text Summarization</span>
</a>
</li>
</ul>
</div>
<div id="main" class="tab-content">
<h2 class="title is-4">Overall Performance Across All Tasks</h2>
<div class="results-table">
<table class="table is-striped is-hoverable is-fullwidth is-size-7">
<thead>
<tr class="has-background-grey-lighter">
<th>Model</th>
<th colspan="4" class="has-text-centered column-border-left">Information Retrieval</th>
<th class="has-text-centered column-border-left">*</th>
<th colspan="3" class="has-text-centered column-border-left">Sentiment Analysis</th>
<th colspan="2" class="has-text-centered column-border-left">Causal Analysis</th>
<th colspan="5" class="has-text-centered column-border-left">Text Classification</th>
<th colspan="3" class="has-text-centered column-border-left">Question Answering</th>
<th colspan="2" class="has-text-centered column-border-left">Summarization</th>
</tr>
<tr>
<th>Dataset</th>
<th class="has-text-centered tooltip-trigger column-border-left" data-tooltip="FiNER-Open Research Dataset: A manually annotated dataset for financial named entity recognition, containing 47,851 financial news articles with annotations for person, location, and organization entities.">FiNER</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="FinRED: A specialized relation extraction dataset for the financial domain, created from financial news and earnings call transcripts, with financial relations mapped using distance supervision based on Wikidata triplets.">FR</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="ReFINED: A financial entity dataset focusing on entity recognition and disambiguation in financial documents.">RD</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Financial Numeric Extreme Labeling (FNXL): A dataset addressing the challenge of automating the annotation of numerals in financial statements with appropriate labels from a vast taxonomy.">FNXL</th>
<th class="has-text-centered tooltip-trigger column-border-left" data-tooltip="FinEntity: A dataset for financial entity recognition and classification.">FE</th>
<th class="has-text-centered tooltip-trigger column-border-left" data-tooltip="FiQA Task 1: A dataset for aspect-based financial sentiment analysis. Predicts sentiment scores on a continuous scale from -1 (negative) to 1 (positive) for financial texts such as microblog posts or news headlines.">FiQA</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="SubjECTive-QA: A manually-annotated dataset focusing on subjectivity in Earnings Call Transcripts, including 49,446 annotations across 2,747 QA pairs labeled on six subjectivity features.">SQA</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Financial Phrase Bank: A dataset for sentiment analysis containing 4,840 sentences from English-language financial news articles, categorized as positive, negative, or neutral sentiment.">FPB</th>
<th class="has-text-centered tooltip-trigger column-border-left" data-tooltip="FinCausal Causality Detection: For text sections identified as causal, this task extracts the Cause and Effect spans, handling both unicausal and multicausal cases in financial texts.">CD</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="FinCausal Causality Classification: Determines if a given financial text section contains a causal relation, labeled as 1 if causal and 0 otherwise.">CC</th>
<th class="has-text-centered tooltip-trigger column-border-left" data-tooltip="Banking77: A fine-grained dataset for intent detection within the banking domain, comprising 13,083 customer service queries annotated with 77 unique intents.">B77</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="FinBench: A dataset designed to evaluate machine learning models using tabular data and profile text inputs for financial risk prediction, covering default, fraud, and churn risks.">FB</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Federal Open Market Committee: A dataset of FOMC speeches, meeting minutes, and press conference transcripts (1996-2022) for analyzing monetary policy announcements, with hawkish-dovish classification.">FOMC</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Numerical Claim Detection Dataset: An expert-annotated dataset for detecting fine-grained investor claims within financial narratives, with focus on numerals from analyst reports and earnings call transcripts.">NC</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Headlines: A dataset of 11,412 human-annotated financial news headlines focused on commodities (gold), including indicators for price mentions, price movement direction, and references to past/future prices.">HL</th>
<th class="has-text-centered tooltip-trigger column-border-left" data-tooltip="ConvFinQA: A multi-turn question answering dataset with 3,892 conversations and 14,115 questions exploring chains of numerical reasoning in conversational QA within the financial domain.">CFQA</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="FinQA: A large-scale dataset for numerical reasoning over financial data, consisting of 8,281 question-answer pairs from financial reports that require multi-step reasoning.">FinQA</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="TAT-QA: A large-scale question-answering dataset for hybrid data sources combining tabular and textual content from financial reports, requiring numerical reasoning operations.">TQA</th>
<th class="has-text-centered tooltip-trigger column-border-left" data-tooltip="ECTSum: A dataset for bullet-point summarization of long earnings call transcripts, with 2,425 document-summary pairs from publicly traded companies' earnings calls between 2019-2022.">ECTSum</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="EDTSum: A financial news summarization dataset with 2,000 financial news articles, each paired with its headline as the ground-truth summary, for evaluating LLMs in generating concise summaries.">EDTSum</th>
</tr>
<tr class="has-background-light row-border-bottom">
<th>Metric Used</th>
<th colspan="5" class="has-text-centered column-border-left">F1 Score</th>
<th class="has-text-centered column-border-left">MSE</th>
<th colspan="8" class="has-text-centered column-border-left">F1 Score</th>
<th colspan="4" class="has-text-centered column-border-left">Accuracy</th>
<th colspan="2" class="has-text-centered column-border-left">BERTScore F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 3 70B Instruct</td>
<td class="column-border-left">.701</td><td>.332</td><td>.883</td><td>.020</td>
<td class="column-border-left">.469</td>
<td class="column-border-left">.123</td><td>.535</td><td>.902</td>
<td class="column-border-left">.142</td><td>.192</td>
<td class="column-border-left">.645</td><td>.309</td><td>.652</td><td>.386</td><td>.811</td>
<td class="column-border-left">.709</td><td>.809</td><td>.772</td>
<td class="column-border-left">.754</td><td class="performance-strong">.817</td>
</tr>
<tr>
<td>Llama 3 8B Instruct</td>
<td class="column-border-left">.565</td><td>.289</td><td>.705</td><td>.003</td>
<td class="column-border-left">.350</td>
<td class="column-border-left">.161</td><td class="performance-best">.600</td><td>.698</td>
<td class="column-border-left">.049</td><td>.234</td>
<td class="column-border-left">.512</td><td>.659</td><td>.497</td><td>.511</td><td>.763</td>
<td class="column-border-left">.268</td><td>.767</td><td>.706</td>
<td class="column-border-left">.757</td><td>.811</td>
</tr>
<tr>
<td>DBRX Instruct</td>
<td class="column-border-left">.489</td><td>.304</td><td>.778</td><td>.009</td>
<td class="column-border-left">.006</td>
<td class="column-border-left">.160</td><td>.436</td><td>.499</td>
<td class="column-border-left">.087</td><td>.231</td>
<td class="column-border-left">.574</td><td>.483</td><td>.193</td><td>.319</td><td>.746</td>
<td class="column-border-left">.252</td><td>.738</td><td>.633</td>
<td class="column-border-left">.729</td><td>.806</td>
</tr>
<tr>
<td>DeepSeek LLM (67B)</td>
<td class="column-border-left">.745</td><td>.334</td><td>.879</td><td>.007</td>
<td class="column-border-left">.416</td>
<td class="column-border-left">.118</td><td>.462</td><td>.811</td>
<td class="column-border-left">.025</td><td>.193</td>
<td class="column-border-left">.578</td><td>.492</td><td>.407</td><td>.151</td><td>.778</td>
<td class="column-border-left">.174</td><td>.742</td><td>.355</td>
<td class="column-border-left">.681</td><td>.807</td>
</tr>
<tr>
<td>Gemma 2 27B</td>
<td class="column-border-left">.761</td><td>.356</td><td>.902</td><td>.006</td>
<td class="column-border-left">.298</td>
<td class="column-border-left" class="performance-best">.100</td><td>.515</td><td>.884</td>
<td class="column-border-left">.133</td><td>.242</td>
<td class="column-border-left">.621</td><td>.538</td><td>.620</td><td>.408</td><td>.808</td>
<td class="column-border-left">.268</td><td>.768</td><td>.734</td>
<td class="column-border-left">.723</td><td>.814</td>
</tr>
<tr>
<td>Gemma 2 9B</td>
<td class="column-border-left">.651</td><td>.331</td><td>.892</td><td>.005</td>
<td class="column-border-left">.367</td>
<td class="column-border-left">.189</td><td>.491</td><td class="performance-strong">.940</td>
<td class="column-border-left">.105</td><td>.207</td>
<td class="column-border-left">.609</td><td>.541</td><td>.519</td><td>.365</td><td class="performance-best">.856</td>
<td class="column-border-left">.292</td><td>.779</td><td>.750</td>
<td class="column-border-left">.585</td><td class="performance-strong">.817</td>
</tr>
<tr>
<td>Mistral (7B) Instruct v0.3</td>
<td class="column-border-left">.526</td><td>.276</td><td>.771</td><td>.004</td>
<td class="column-border-left">.368</td>
<td class="column-border-left">.135</td><td>.522</td><td>.841</td>
<td class="column-border-left">.052</td><td>.227</td>
<td class="column-border-left">.528</td><td>.503</td><td>.542</td><td>.412</td><td>.779</td>
<td class="column-border-left">.199</td><td>.655</td><td>.553</td>
<td class="column-border-left">.750</td><td>.811</td>
</tr>
<tr>
<td>Mixtral-8x22B Instruct</td>
<td class="column-border-left">.635</td><td>.367</td><td>.811</td><td>.009</td>
<td class="column-border-left">.435</td>
<td class="column-border-left">.221</td><td>.510</td><td>.776</td>
<td class="column-border-left">.125</td><td class="performance-best">.308</td>
<td class="column-border-left">.602</td><td>.221</td><td>.465</td><td>.513</td><td class="performance-strong">.835</td>
<td class="column-border-left">.285</td><td>.766</td><td>.666</td>
<td class="column-border-left">.758</td><td>.815</td>
</tr>
<tr>
<td>Mixtral-8x7B Instruct</td>
<td class="column-border-left">.598</td><td>.282</td><td>.845</td><td>.009</td>
<td class="column-border-left">.267</td>
<td class="column-border-left">.208</td><td>.498</td><td>.893</td>
<td class="column-border-left">.055</td><td>.229</td>
<td class="column-border-left">.547</td><td>.396</td><td>.603</td><td>.583</td><td>.805</td>
<td class="column-border-left">.315</td><td>.611</td><td>.501</td>
<td class="column-border-left">.747</td><td>.810</td>
</tr>
<tr>
<td>Qwen 2 Instruct (72B)</td>
<td class="column-border-left">.748</td><td>.348</td><td>.854</td><td>.012</td>
<td class="column-border-left">.483</td>
<td class="column-border-left">.205</td><td>.576</td><td>.901</td>
<td class="column-border-left">.190</td><td>.184</td>
<td class="column-border-left">.627</td><td>.495</td><td>.605</td><td>.639</td><td>.830</td>
<td class="column-border-left">.269</td><td>.819</td><td>.715</td>
<td class="column-border-left">.752</td><td>.811</td>
</tr>
<tr>
<td>WizardLM-2 8x22B</td>
<td class="column-border-left">.744</td><td>.355</td><td>.852</td><td>.008</td>
<td class="column-border-left">.226</td>
<td class="column-border-left">.129</td><td>.566</td><td>.779</td>
<td class="column-border-left">.114</td><td>.201</td>
<td class="column-border-left">.648</td><td>.500</td><td>.505</td><td>.272</td><td>.797</td>
<td class="column-border-left">.247</td><td>.796</td><td>.725</td>
<td class="column-border-left">.735</td><td>.808</td>
</tr>
<tr>
<td>DeepSeek-V3</td>
<td class="column-border-left" class="performance-strong">.790</td><td class="performance-strong">.437</td><td>.934</td><td class="performance-strong">.045</td>
<td class="column-border-left">.549</td>
<td class="column-border-left">.150</td><td class="performance-strong">.583</td><td>.814</td>
<td class="column-border-left" class="performance-strong">.198</td><td>.170</td>
<td class="column-border-left" class="performance-strong">.714</td><td>.487</td><td>.578</td><td>.675</td><td>.729</td>
<td class="column-border-left">.261</td><td class="performance-strong">.840</td><td class="performance-strong">.779</td>
<td class="column-border-left">.750</td><td>.815</td>
</tr>
<tr>
<td>DeepSeek R1</td>
<td class="column-border-left" class="performance-best">.807</td><td>.393</td><td class="performance-best">.952</td><td class="performance-best">.057</td>
<td class="column-border-left" class="performance-strong">.587</td>
<td class="column-border-left">.110</td><td>.499</td><td>.902</td>
<td class="column-border-left" class="performance-best">.337</td><td>.202</td>
<td class="column-border-left" class="performance-best">.763</td><td>.419</td><td class="performance-strong">.670</td><td>.688</td><td>.769</td>
<td class="column-border-left" class="performance-best">.853</td><td class="performance-strong">.836</td><td class="performance-best">.858</td>
<td class="column-border-left">.759</td><td>.804</td>
</tr>
<tr>
<td>QwQ-32B-Preview</td>
<td class="column-border-left">.685</td><td>.270</td><td>.656</td><td>.001</td>
<td class="column-border-left">.005</td>
<td class="column-border-left">.141</td><td>.550</td><td>.815</td>
<td class="column-border-left">.131</td><td>.220</td>
<td class="column-border-left">.613</td><td class="performance-strong">.784</td><td>.555</td><td>.020</td><td>.744</td>
<td class="column-border-left">.282</td><td>.793</td><td class="performance-strong">.796</td>
<td class="column-border-left">.696</td><td class="performance-strong">.817</td>
</tr>
<tr>
<td>Jamba 1.5 Mini</td>
<td class="column-border-left">.552</td><td>.284</td><td>.844</td><td>.005</td>
<td class="column-border-left">.132</td>
<td class="column-border-left">.119</td><td>.418</td><td>.765</td>
<td class="column-border-left">.043</td><td class="performance-strong">.270</td>
<td class="column-border-left">.508</td><td class="performance-best">.898</td><td>.499</td><td>.151</td><td>.682</td>
<td class="column-border-left">.218</td><td>.666</td><td>.586</td>
<td class="column-border-left">.741</td><td>.816</td>
</tr>
<tr>
<td>Jamba 1.5 Large</td>
<td class="column-border-left">.693</td><td>.341</td><td>.862</td><td>.005</td>
<td class="column-border-left">.397</td>
<td class="column-border-left">.183</td><td>.582</td><td>.798</td>
<td class="column-border-left">.074</td><td>.176</td>
<td class="column-border-left">.628</td><td>.618</td><td>.550</td><td>.541</td><td>.782</td>
<td class="column-border-left">.225</td><td>.790</td><td>.660</td>
<td class="column-border-left">.734</td><td class="performance-best">.818</td>
</tr>
<tr>
<td>Claude 3.5 Sonnet</td>
<td class="column-border-left" class="performance-strong">.799</td><td class="performance-best">.439</td><td>.891</td><td class="performance-strong">.047</td>
<td class="column-border-left" class="performance-strong">.655</td>
<td class="column-border-left" class="performance-strong">.101</td><td>.553</td><td class="performance-best">.944</td>
<td class="column-border-left">.196</td><td>.197</td>
<td class="column-border-left">.668</td><td>.634</td><td class="performance-best">.674</td><td class="performance-strong">.692</td><td>.827</td>
<td class="column-border-left">.402</td><td class="performance-best">.844</td><td>.700</td>
<td class="column-border-left" class="performance-strong">.767</td><td>.813</td>
</tr>
<tr>
<td>Claude 3 Haiku</td>
<td class="column-border-left">.711</td><td>.285</td><td>.883</td><td>.015</td>
<td class="column-border-left">.494</td>
<td class="column-border-left">.167</td><td>.463</td><td>.908</td>
<td class="column-border-left">.081</td><td>.200</td>
<td class="column-border-left">.622</td><td>.022</td><td>.631</td><td>.558</td><td>.781</td>
<td class="column-border-left">.421</td><td>.803</td><td>.733</td>
<td class="column-border-left">.646</td><td>.808</td>
</tr>
<tr>
<td>Cohere Command R 7B</td>
<td class="column-border-left">.748</td><td>.194</td><td>.845</td><td>.018</td>
<td class="column-border-left">.441</td>
<td class="column-border-left">.164</td><td>.532</td><td>.840</td>
<td class="column-border-left">.057</td><td class="performance-strong">.255</td>
<td class="column-border-left">.516</td><td class="performance-strong">.762</td><td>.459</td><td>.068</td><td>.770</td>
<td class="column-border-left">.212</td><td>.709</td><td>.716</td>
<td class="column-border-left">.750</td><td>.815</td>
</tr>
<tr>
<td>Cohere Command R +</td>
<td class="column-border-left">.756</td><td>.333</td><td>.922</td><td>.021</td>
<td class="column-border-left">.452</td>
<td class="column-border-left" class="performance-strong">.106</td><td>.533</td><td>.699</td>
<td class="column-border-left">.080</td><td>.238</td>
<td class="column-border-left">.651</td><td>.684</td><td>.393</td><td>.118</td><td>.812</td>
<td class="column-border-left">.259</td><td>.776</td><td>.698</td>
<td class="column-border-left">.751</td><td>.810</td>
</tr>
<tr>
<td>Google Gemini 1.5 Pro</td>
<td class="column-border-left">.712</td><td>.374</td><td class="performance-strong">.944</td><td>.019</td>
<td class="column-border-left">.393</td>
<td class="column-border-left">.144</td><td class="performance-strong">.593</td><td>.885</td>
<td class="column-border-left">.196</td><td>.217</td>
<td class="column-border-left">.418</td><td>.336</td><td>.579</td><td>.525</td><td class="performance-strong">.837</td>
<td class="column-border-left">.280</td><td>.829</td><td>.763</td>
<td class="column-border-left" class="performance-best">.777</td><td class="performance-strong">.817</td>
</tr>
<tr>
<td>OpenAI gpt-4o</td>
<td class="column-border-left">.766</td><td>.399</td><td class="performance-strong">.942</td><td>.037</td>
<td class="column-border-left">.523</td>
<td class="column-border-left">.184</td><td>.541</td><td class="performance-strong">.928</td>
<td class="column-border-left">.130</td><td>.222</td>
<td class="column-border-left" class="performance-strong">.710</td><td>.524</td><td class="performance-strong">.664</td><td class="performance-best">.750</td><td>.824</td>
<td class="column-border-left" class="performance-strong">.749</td><td>.836</td><td>.754</td>
<td class="column-border-left" class="performance-strong">.773</td><td>.816</td>
</tr>
<tr>
<td>OpenAI o1-mini</td>
<td class="column-border-left">.761</td><td class="performance-strong">.403</td><td>.876</td><td>.010</td>
<td class="column-border-left" class="performance-best">.662</td>
<td class="column-border-left">.120</td><td>.542</td><td>.917</td>
<td class="column-border-left" class="performance-strong">.289</td><td>.209</td>
<td class="column-border-left">.670</td><td>.612</td><td>.635</td><td class="performance-strong">.720</td><td>.769</td>
<td class="column-border-left" class="performance-strong">.840</td><td>.799</td><td>.698</td>
<td class="column-border-left">.763</td><td>.816</td>
</tr>
</tbody>
</table>
</div>
</div>
<!-- Causal Analysis tab content -->
<div id="causal-analysis" class="tab-content">
<h2 class="title is-4">Causal Analysis Results</h2>
<div class="table-container">
<table class="table is-bordered is-striped is-narrow is-hoverable">
<thead>
<tr>
<th rowspan="3">Model</th>
<th colspan="4" class="has-text-centered tooltip-trigger" style="border-bottom: 2px solid #dbdbdb;" data-tooltip="FinCausal Causal Detection (CD): For text sections identified as causal, this task extracts the Cause and Effect spans, handling both unicausal and multicausal cases in financial texts.">Causal Detection</th>
<th colspan="4" class="has-text-centered tooltip-trigger" style="border-bottom: 2px solid #dbdbdb;" data-tooltip="FinCausal Causality Classification (CC): Determines if a given financial text section contains a causal relation, labeled as 1 if causal and 0 otherwise.">Causal Classification</th>
</tr>
<tr>
<th class="has-text-centered tooltip-trigger" data-tooltip="Accuracy: The proportion of correctly identified cause-effect relationships among all predictions.">Accuracy</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Precision: The proportion of correctly identified causal relationships among all predicted causal relationships.">Precision</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Recall: The proportion of correctly identified causal relationships among all actual causal relationships.">Recall</th>
<th class="has-text-centered tooltip-trigger column-border-right" data-tooltip="F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics.">F1</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Precision: The proportion of correctly classified causal statements among all statements predicted as causal.">Precision</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Recall: The proportion of correctly classified causal statements among all actual causal statements.">Recall</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics.">F1</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Accuracy: The proportion of correctly classified statements (both causal and non-causal) among all statements.">Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 3 70B Instruct</td>
<td>0.148</td>
<td>0.429</td>
<td>0.148</td>
<td class="column-border-right">0.142</td>
<td>0.241</td>
<td>0.329</td>
<td>0.192</td>
<td>0.198</td>
</tr>
<tr>
<td>Llama 3 8B Instruct</td>
<td>0.097</td>
<td>0.341</td>
<td>0.097</td>
<td class="column-border-right">0.049</td>
<td>0.232</td>
<td>0.241</td>
<td>0.234</td>
<td class="performance-strong">0.380</td>
</tr>
<tr>
<td>DBRX Instruct</td>
<td>0.078</td>
<td>0.521</td>
<td>0.078</td>
<td class="column-border-right">0.087</td>
<td>0.276</td>
<td>0.313</td>
<td>0.231</td>
<td>0.235</td>
</tr>
<tr>
<td>DeepSeek LLM (67B)</td>
<td>0.026</td>
<td>0.214</td>
<td>0.026</td>
<td class="column-border-right">0.025</td>
<td>0.141</td>
<td>0.328</td>
<td>0.193</td>
<td>0.221</td>
</tr>
<tr>
<td>Gemma 2 27B</td>
<td>0.115</td>
<td>0.510</td>
<td>0.115</td>
<td class="column-border-right">0.133</td>
<td>0.309</td>
<td>0.310</td>
<td>0.242</td>
<td>0.262</td>
</tr>
<tr>
<td>Gemma 2 9B</td>
<td>0.115</td>
<td>0.394</td>
<td>0.115</td>
<td class="column-border-right">0.105</td>
<td>0.275</td>
<td>0.294</td>
<td>0.207</td>
<td>0.258</td>
</tr>
<tr>
<td>Mistral (7B) Instruct v0.3</td>
<td>0.078</td>
<td>0.455</td>
<td>0.078</td>
<td class="column-border-right">0.052</td>
<td>0.339</td>
<td class="performance-best">0.361</td>
<td>0.227</td>
<td>0.258</td>
</tr>
<tr>
<td>Mixtral-8x22B Instruct</td>
<td>0.131</td>
<td>0.486</td>
<td>0.131</td>
<td class="column-border-right">0.125</td>
<td>0.344</td>
<td>0.310</td>
<td class="performance-best">0.308</td>
<td class="performance-strong">0.318</td>
</tr>
<tr>
<td>Mixtral-8x7B Instruct</td>
<td>0.088</td>
<td>0.510</td>
<td>0.088</td>
<td class="column-border-right">0.055</td>
<td>0.308</td>
<td>0.314</td>
<td>0.229</td>
<td>0.273</td>
</tr>
<tr>
<td>Qwen 2 Instruct (72B)</td>
<td>0.139</td>
<td>0.489</td>
<td>0.139</td>
<td class="column-border-right">0.190</td>
<td>0.208</td>
<td>0.330</td>
<td>0.184</td>
<td>0.188</td>
</tr>
<tr>
<td>WizardLM-2 8x22B</td>
<td>0.076</td>
<td>0.453</td>
<td>0.076</td>
<td class="column-border-right">0.114</td>
<td>0.263</td>
<td>0.347</td>
<td>0.201</td>
<td>0.237</td>
</tr>
<tr>
<td>DeepSeek-V3</td>
<td>0.164</td>
<td>0.528</td>
<td>0.164</td>
<td class="performance-strong column-border-right">0.198</td>
<td>0.194</td>
<td>0.327</td>
<td>0.170</td>
<td>0.248</td>
</tr>
<tr>
<td>DeepSeek R1</td>
<td class="performance-best">0.245</td>
<td class="performance-strong">0.643</td>
<td class="performance-best">0.245</td>
<td class="performance-best column-border-right">0.337</td>
<td class="performance-best">0.385</td>
<td>0.318</td>
<td>0.202</td>
<td>0.221</td>
</tr>
<tr>
<td>QwQ-32B-Preview</td>
<td>0.110</td>
<td>0.473</td>
<td>0.110</td>
<td class="column-border-right">0.131</td>
<td>0.193</td>
<td>0.262</td>
<td>0.220</td>
<td class="performance-best">0.465</td>
</tr>
<tr>
<td>Jamba 1.5 Mini</td>
<td>0.050</td>
<td>0.280</td>
<td>0.050</td>
<td class="column-border-right">0.043</td>
<td>0.323</td>
<td>0.283</td>
<td class="performance-strong">0.270</td>
<td>0.295</td>
</tr>
<tr>
<td>Jamba 1.5 Large</td>
<td>0.076</td>
<td>0.517</td>
<td>0.076</td>
<td class="column-border-right">0.074</td>
<td>0.268</td>
<td>0.248</td>
<td>0.176</td>
<td>0.200</td>
</tr>
<tr>
<td>Claude 3.5 Sonnet</td>
<td>0.154</td>
<td>0.564</td>
<td>0.154</td>
<td class="column-border-right">0.196</td>
<td>0.259</td>
<td>0.336</td>
<td>0.197</td>
<td>0.235</td>
</tr>
<tr>
<td>Claude 3 Haiku</td>
<td>0.082</td>
<td>0.388</td>
<td>0.082</td>
<td class="column-border-right">0.081</td>
<td class="performance-strong">0.369</td>
<td>0.347</td>
<td>0.200</td>
<td>0.203</td>
</tr>
<tr>
<td>Cohere Command R 7B</td>
<td>0.089</td>
<td>0.363</td>
<td>0.089</td>
<td class="column-border-right">0.057</td>
<td class="performance-strong">0.379</td>
<td class="performance-strong">0.356</td>
<td class="performance-strong">0.255</td>
<td>0.275</td>
</tr>
<tr>
<td>Cohere Command R +</td>
<td>0.090</td>
<td>0.453</td>
<td>0.090</td>
<td class="column-border-right">0.080</td>
<td>0.353</td>
<td>0.336</td>
<td>0.238</td>
<td>0.265</td>
</tr>
<tr>
<td>Google Gemini 1.5 Pro</td>
<td class="performance-strong">0.165</td>
<td>0.514</td>
<td class="performance-strong">0.165</td>
<td class="column-border-right">0.196</td>
<td>0.265</td>
<td class="performance-strong">0.357</td>
<td>0.217</td>
<td>0.258</td>
</tr>
<tr>
<td>OpenAI gpt-4o</td>
<td>0.082</td>
<td class="performance-strong">0.576</td>
<td>0.082</td>
<td class="column-border-right">0.130</td>
<td>0.254</td>
<td>0.327</td>
<td>0.222</td>
<td>0.235</td>
</tr>
<tr>
<td>OpenAI o1-mini</td>
<td class="performance-strong">0.206</td>
<td class="performance-best">0.648</td>
<td class="performance-strong">0.206</td>
<td class="performance-strong column-border-right">0.289</td>
<td>0.325</td>
<td>0.316</td>
<td>0.209</td>
<td>0.233</td>
</tr>
</tbody>
</table>
<div class="content is-small mt-4">
<p><strong>Note:</strong> Color highlighting indicates performance ranking:
<span class="performance-best">&nbsp;Best&nbsp;</span>,
<span class="performance-strong">&nbsp;Strong&nbsp;</span>
</p>
</div>
</div>
</div>
<!-- Information Retrieval -->
<div id="information-retrieval" class="tab-content">
<h2 class="title is-4">Information Retrieval Task Results</h2>
<div class="table-container">
<table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
<thead>
<tr>
<th rowspan="3">Model</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="FiNER-Open Research Dataset: A manually annotated dataset for financial named entity recognition, containing 47,851 financial news articles with annotations for person, location, and organization entities.">FiNER</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="FinRED: A specialized relation extraction dataset for the financial domain, created from financial news and earnings call transcripts, with financial relations mapped using distance supervision based on Wikidata triplets.">FinRed</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="REFinD: A specialized relation extraction dataset created to address the unique challenges of extracting relationships between entity pairs from financial texts.">ReFiND</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="Financial Numeric Extreme Labeling (FNXL): A dataset addressing the challenge of automating the annotation of numerals in financial statements with appropriate labels from a vast taxonomy.">FNXL</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="FinEntity: An entity-level sentiment classification dataset designed for financial news analysis containing 979 financial news paragraphs with 2,131 manually-annotated financial entities.">FinEntity</th>
</tr>
<tr>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered column-border-right">Accuracy</th>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered column-border-right">F1</th>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered column-border-right">F1</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered column-border-right">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 3 70B Instruct</td>
<td>0.715</td><td>0.693</td><td>0.701</td><td class="column-border-right">0.911</td>
<td>0.314</td><td class="performance-strong">0.454</td><td>0.314</td><td class="column-border-right">0.332</td>
<td>0.879</td><td>0.904</td><td>0.879</td><td class="column-border-right">0.883</td>
<td>0.015</td><td>0.030</td><td>0.020</td><td class="column-border-right">0.010</td>
<td>0.474</td><td>0.485</td><td>0.485</td><td>0.469</td>
</tr>
<tr>
<td>Llama 3 8B Instruct</td>
<td>0.581</td><td>0.558</td><td>0.565</td><td class="column-border-right">0.854</td>
<td>0.296</td><td>0.357</td><td>0.296</td><td class="column-border-right">0.289</td>
<td>0.723</td><td>0.755</td><td>0.723</td><td class="column-border-right">0.705</td>
<td>0.003</td><td>0.004</td><td>0.003</td><td class="column-border-right">0.002</td>
<td>0.301</td><td>0.478</td><td>0.478</td><td>0.350</td>
</tr>
<tr>
<td>DBRX Instruct</td>
<td>0.516</td><td>0.476</td><td>0.489</td><td class="column-border-right">0.802</td>
<td>0.329</td><td>0.371</td><td>0.329</td><td class="column-border-right">0.304</td>
<td>0.766</td><td>0.825</td><td>0.766</td><td class="column-border-right">0.778</td>
<td>0.008</td><td>0.011</td><td>0.009</td><td class="column-border-right">0.005</td>
<td>0.004</td><td>0.014</td><td>0.014</td><td>0.006</td>
</tr>
<tr>
<td>DeepSeek LLM (67B)</td>
<td>0.752</td><td>0.742</td><td>0.745</td><td class="column-border-right">0.917</td>
<td>0.344</td><td>0.403</td><td>0.344</td><td class="column-border-right">0.334</td>
<td>0.874</td><td>0.890</td><td>0.874</td><td class="column-border-right">0.879</td>
<td>0.005</td><td>0.009</td><td>0.007</td><td class="column-border-right">0.003</td>
<td>0.456</td><td>0.405</td><td>0.405</td><td>0.416</td>
</tr>
<tr>
<td>Gemma 2 27B</td>
<td>0.772</td><td>0.754</td><td>0.761</td><td class="column-border-right performance-strong">0.923</td>
<td>0.352</td><td>0.437</td><td>0.352</td><td class="column-border-right">0.356</td>
<td>0.897</td><td>0.914</td><td>0.897</td><td class="column-border-right">0.902</td>
<td>0.005</td><td>0.008</td><td>0.006</td><td class="column-border-right">0.003</td>
<td>0.320</td><td>0.295</td><td>0.295</td><td>0.298</td>
</tr>
<tr>
<td>Gemma 2 9B</td>
<td>0.665</td><td>0.643</td><td>0.651</td><td class="column-border-right">0.886</td>
<td>0.336</td><td>0.373</td><td>0.336</td><td class="column-border-right">0.331</td>
<td>0.885</td><td>0.902</td><td>0.885</td><td class="column-border-right">0.892</td>
<td>0.004</td><td>0.008</td><td>0.005</td><td class="column-border-right">0.003</td>
<td>0.348</td><td>0.419</td><td>0.419</td><td>0.367</td>
</tr>
<tr>
<td>Mistral (7B) Instruct v0.3</td>
<td>0.540</td><td>0.522</td><td>0.526</td><td class="column-border-right">0.806</td>
<td>0.278</td><td>0.383</td><td>0.278</td><td class="column-border-right">0.276</td>
<td>0.767</td><td>0.817</td><td>0.767</td><td class="column-border-right">0.771</td>
<td>0.004</td><td>0.006</td><td>0.004</td><td class="column-border-right">0.002</td>
<td>0.337</td><td>0.477</td><td>0.477</td><td>0.368</td>
</tr>
<tr>
<td>Mixtral-8x22B Instruct</td>
<td>0.653</td><td>0.625</td><td>0.635</td><td class="column-border-right">0.870</td>
<td>0.381</td><td>0.414</td><td>0.381</td><td class="column-border-right">0.367</td>
<td>0.807</td><td>0.847</td><td>0.807</td><td class="column-border-right">0.811</td>
<td>0.010</td><td>0.008</td><td>0.009</td><td class="column-border-right">0.005</td>
<td>0.428</td><td>0.481</td><td>0.481</td><td>0.435</td>
</tr>
<tr>
<td>Mixtral-8x7B Instruct</td>
<td>0.613</td><td>0.591</td><td>0.598</td><td class="column-border-right">0.875</td>
<td>0.291</td><td>0.376</td><td>0.291</td><td class="column-border-right">0.282</td>
<td>0.840</td><td>0.863</td><td>0.840</td><td class="column-border-right">0.845</td>
<td>0.007</td><td>0.012</td><td>0.009</td><td class="column-border-right">0.005</td>
<td>0.251</td><td>0.324</td><td>0.324</td><td>0.267</td>
</tr>
<tr>
<td>Qwen 2 Instruct (72B)</td>
<td>0.766</td><td>0.742</td><td>0.748</td><td class="column-border-right">0.899</td>
<td>0.365</td><td>0.407</td><td>0.365</td><td class="column-border-right">0.348</td>
<td>0.850</td><td>0.881</td><td>0.850</td><td class="column-border-right">0.854</td>
<td>0.010</td><td>0.016</td><td>0.012</td><td class="column-border-right">0.006</td>
<td>0.468</td><td>0.530</td><td>0.530</td><td>0.483</td>
</tr>
<tr>
<td>WizardLM-2 8x22B</td>
<td>0.755</td><td>0.741</td><td>0.744</td><td class="column-border-right">0.920</td>
<td>0.362</td><td>0.397</td><td>0.362</td><td class="column-border-right">0.355</td>
<td>0.846</td><td>0.874</td><td>0.846</td><td class="column-border-right">0.852</td>
<td>0.008</td><td>0.009</td><td>0.008</td><td class="column-border-right">0.004</td>
<td>0.222</td><td>0.247</td><td>0.247</td><td>0.226</td>
</tr>
<tr>
<td>DeepSeek-V3</td>
<td class="performance-strong">0.798</td><td class="performance-strong">0.787</td><td class="performance-strong">0.790</td><td class="column-border-right performance-best">0.945</td>
<td class="performance-best">0.450</td><td class="performance-best">0.463</td><td class="performance-best">0.450</td><td class="column-border-right performance-best">0.437</td>
<td>0.927</td><td class="performance-strong">0.943</td><td>0.927</td><td class="column-border-right">0.934</td>
<td class="performance-best">0.034</td><td class="performance-strong">0.067</td><td class="performance-strong">0.045</td><td class="column-border-right performance-strong">0.023</td>
<td>0.563</td><td>0.544</td><td>0.544</td><td>0.549</td>
</tr>
<tr>
<td>DeepSeek R1</td>
<td class="performance-best">0.813</td><td class="performance-best">0.805</td><td class="performance-best">0.807</td><td class="column-border-right performance-best">0.944</td>
<td class="performance-strong">0.412</td><td>0.424</td><td class="performance-strong">0.412</td><td class="column-border-right">0.393</td>
<td class="performance-best">0.946</td><td class="performance-best">0.960</td><td class="performance-best">0.946</td><td class="column-border-right performance-best">0.952</td>
<td class="performance-best">0.044</td><td class="performance-best">0.082</td><td class="performance-best">0.057</td><td class="column-border-right performance-best">0.029</td>
<td class="performance-strong">0.600</td><td class="performance-strong">0.586</td><td class="performance-strong">0.586</td><td class="performance-strong">0.587</td>
</tr>
<tr>
<td>QwQ-32B-Preview</td>
<td>0.695</td><td>0.681</td><td>0.685</td><td class="column-border-right">0.907</td>
<td>0.278</td><td>0.396</td><td>0.278</td><td class="column-border-right">0.270</td>
<td>0.680</td><td>0.770</td><td>0.680</td><td class="column-border-right">0.656</td>
<td>0.001</td><td>0.001</td><td>0.001</td><td class="column-border-right">0.000</td>
<td>0.005</td><td>0.005</td><td>0.005</td><td>0.005</td>
</tr>
<tr>
<td>Jamba 1.5 Mini</td>
<td>0.564</td><td>0.556</td><td>0.552</td><td class="column-border-right">0.818</td>
<td>0.308</td><td>0.450</td><td>0.308</td><td class="column-border-right">0.284</td>
<td>0.830</td><td>0.864</td><td>0.830</td><td class="column-border-right">0.844</td>
<td>0.004</td><td>0.006</td><td>0.005</td><td class="column-border-right">0.003</td>
<td>0.119</td><td>0.182</td><td>0.182</td><td>0.132</td>
</tr>
<tr>
<td>Jamba 1.5 Large</td>
<td>0.707</td><td>0.687</td><td>0.693</td><td class="column-border-right">0.883</td>
<td>0.341</td><td>0.452</td><td>0.341</td><td class="column-border-right">0.341</td>
<td>0.856</td><td>0.890</td><td>0.856</td><td class="column-border-right">0.862</td>
<td>0.004</td><td>0.005</td><td>0.005</td><td class="column-border-right">0.002</td>
<td>0.403</td><td>0.414</td><td>0.414</td><td>0.397</td>
</tr>
<tr>
<td>Claude 3.5 Sonnet</td>
<td class="performance-best">0.811</td><td class="performance-best">0.794</td><td class="performance-best">0.799</td><td class="column-border-right">0.922</td>
<td class="performance-best">0.455</td><td class="performance-best">0.465</td><td class="performance-best">0.455</td><td class="column-border-right performance-best">0.439</td>
<td>0.873</td><td>0.927</td><td>0.873</td><td class="column-border-right">0.891</td>
<td class="performance-best">0.034</td><td class="performance-best">0.080</td><td class="performance-best">0.047</td><td class="column-border-right performance-best">0.024</td>
<td class="performance-best">0.658</td><td class="performance-best">0.668</td><td class="performance-best">0.668</td><td class="performance-best">0.655</td>
</tr>
<tr>
<td>Claude 3 Haiku</td>
<td>0.732</td><td>0.700</td><td>0.711</td><td class="column-border-right">0.895</td>
<td>0.294</td><td>0.330</td><td>0.294</td><td class="column-border-right">0.285</td>
<td>0.879</td><td>0.917</td><td>0.879</td><td class="column-border-right">0.883</td>
<td>0.011</td><td>0.022</td><td>0.015</td><td class="column-border-right">0.008</td>
<td>0.498</td><td>0.517</td><td>0.517</td><td>0.494</td>
</tr>
<tr>
<td>Cohere Command R +</td>
<td>0.769</td><td>0.750</td><td>0.756</td><td class="column-border-right">0.902</td>
<td>0.353</td><td>0.405</td><td>0.353</td><td class="column-border-right">0.333</td>
<td>0.917</td><td>0.930</td><td>0.917</td><td class="column-border-right">0.922</td>
<td>0.016</td><td>0.032</td><td>0.021</td><td class="column-border-right">0.011</td>
<td>0.462</td><td>0.459</td><td>0.459</td><td>0.452</td>
</tr>
<tr>
<td>Google Gemini 1.5 Pro</td>
<td>0.728</td><td>0.705</td><td>0.712</td><td class="column-border-right">0.891</td>
<td>0.373</td><td>0.436</td><td>0.373</td><td class="column-border-right">0.374</td>
<td class="performance-strong">0.934</td><td class="performance-best">0.955</td><td class="performance-strong">0.934</td><td class="column-border-right performance-strong">0.944</td>
<td>0.014</td><td>0.028</td><td>0.019</td><td class="column-border-right">0.010</td>
<td>0.399</td><td>0.400</td><td>0.400</td><td>0.393</td>
</tr>
<tr>
<td>OpenAI gpt-4o</td>
<td>0.778</td><td>0.760</td><td>0.766</td><td class="column-border-right">0.911</td>
<td>0.402</td><td>0.445</td><td>0.402</td><td class="column-border-right">0.399</td>
<td class="performance-strong">0.931</td><td class="performance-best">0.955</td><td class="performance-strong">0.931</td><td class="column-border-right performance-strong">0.942</td>
<td class="performance-strong">0.027</td><td>0.056</td><td>0.037</td><td class="column-border-right">0.019</td>
<td>0.537</td><td>0.517</td><td>0.517</td><td>0.523</td>
</tr>
<tr>
<td>OpenAI o1-mini</td>
<td>0.772</td><td>0.755</td><td>0.761</td><td class="column-border-right">0.922</td>
<td>0.407</td><td>0.444</td><td>0.407</td><td class="column-border-right performance-strong">0.403</td>
<td>0.867</td><td>0.900</td><td>0.867</td><td class="column-border-right">0.876</td>
<td>0.007</td><td>0.015</td><td>0.010</td><td class="column-border-right">0.005</td>
<td class="performance-best">0.661</td><td class="performance-best">0.681</td><td class="performance-best">0.681</td><td class="performance-best">0.662</td>
</tr>
</tbody>
</table>
<div class="content is-small mt-4">
<p><strong>Note:</strong> Color highlighting indicates performance ranking:
<span class="performance-best">&nbsp;Best&nbsp;</span>,
<span class="performance-strong">&nbsp;Strong&nbsp;</span>
</p>
</div>
</div>
</div>
<!-- Question Answering -->
<!-- Question Answering -->
<div id="question-answering" class="tab-content">
<h2 class="title is-4">Question Answering Task Results</h2>
<div class="results-table">
<table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="3" class="has-text-centered">Datasets (Accuracy)</th>
</tr>
<tr>
<th class="has-text-centered tooltip-trigger" data-tooltip="Large-scale dataset for numerical reasoning over financial data, consisting of 8,281 question-answer pairs from financial reports. Focuses on questions requiring interpretation of financial data and multi-step reasoning. Licensed under CC BY-NC 4.0.">FinQA</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Multi-turn question answering dataset with 3,892 conversations and 14,115 questions exploring chains of numerical reasoning in financial dialogues. Released under MIT License.">ConvFinQA</th>
<th class="has-text-centered tooltip-trigger" data-tooltip="Large-scale QA dataset for hybrid data sources (tables and text) from financial reports, emphasizing numerical reasoning operations. Licensed under CC BY 4.0.">TATQA</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 3 70B Instruct</td>
<td class="has-text-centered">0.809</td>
<td class="has-text-centered">0.709</td>
<td class="has-text-centered">0.772</td>
</tr>
<tr>
<td>Llama 3 8B Instruct</td>
<td class="has-text-centered">0.767</td>
<td class="has-text-centered">0.268</td>
<td class="has-text-centered">0.706</td>
</tr>
<tr>
<td>DBRX Instruct</td>
<td class="has-text-centered">0.738</td>
<td class="has-text-centered">0.252</td>
<td class="has-text-centered">0.633</td>
</tr>
<tr>
<td>DeepSeek LLM (67B)</td>
<td class="has-text-centered">0.742</td>
<td class="has-text-centered">0.174</td>
<td class="has-text-centered">0.355</td>
</tr>
<tr>
<td>Gemma 2 27B</td>
<td class="has-text-centered">0.768</td>
<td class="has-text-centered">0.268</td>
<td class="has-text-centered">0.734</td>
</tr>
<tr>
<td>Gemma 2 9B</td>
<td class="has-text-centered">0.779</td>
<td class="has-text-centered">0.292</td>
<td class="has-text-centered">0.750</td>
</tr>
<tr>
<td>Mistral (7B) Instruct v0.3</td>
<td class="has-text-centered">0.655</td>
<td class="has-text-centered">0.199</td>
<td class="has-text-centered">0.553</td>
</tr>
<tr>
<td>Mixtral-8x22B Instruct</td>
<td class="has-text-centered">0.766</td>
<td class="has-text-centered">0.285</td>
<td class="has-text-centered">0.666</td>
</tr>
<tr>
<td>Mixtral-8x7B Instruct</td>
<td class="has-text-centered">0.611</td>
<td class="has-text-centered">0.315</td>
<td class="has-text-centered">0.501</td>
</tr>
<tr>
<td>Qwen 2 Instruct (72B)</td>
<td class="has-text-centered">0.819</td>
<td class="has-text-centered">0.269</td>
<td class="has-text-centered">0.715</td>
</tr>
<tr>
<td>WizardLM-2 8x22B</td>
<td class="has-text-centered">0.796</td>
<td class="has-text-centered">0.247</td>
<td class="has-text-centered">0.725</td>
</tr>
<tr>
<td>DeepSeek-V3</td>
<td class="has-text-centered performance-medium">0.840</td>
<td class="has-text-centered">0.261</td>
<td class="has-text-centered performance-low">0.779</td>
</tr>
<tr>
<td>DeepSeek R1</td>
<td class="has-text-centered performance-low">0.836</td>
<td class="has-text-centered performance-best">0.853</td>
<td class="has-text-centered performance-best">0.858</td>
</tr>
<tr>
<td>QwQ-32B-Preview</td>
<td class="has-text-centered">0.793</td>
<td class="has-text-centered">0.282</td>
<td class="has-text-centered performance-medium">0.796</td>
</tr>
<tr>
<td>Jamba 1.5 Mini</td>
<td class="has-text-centered">0.666</td>
<td class="has-text-centered">0.218</td>
<td class="has-text-centered">0.586</td>
</tr>
<tr>
<td>Jamba 1.5 Large</td>
<td class="has-text-centered">0.790</td>
<td class="has-text-centered">0.225</td>
<td class="has-text-centered">0.660</td>
</tr>
<tr>
<td>Claude 3.5 Sonnet</td>
<td class="has-text-centered performance-best">0.844</td>
<td class="has-text-centered">0.402</td>
<td class="has-text-centered">0.700</td>
</tr>
<tr>
<td>Claude 3 Haiku</td>
<td class="has-text-centered">0.803</td>
<td class="has-text-centered">0.421</td>
<td class="has-text-centered">0.733</td>
</tr>
<tr>
<td>Cohere Command R 7B</td>
<td class="has-text-centered">0.709</td>
<td class="has-text-centered">0.212</td>
<td class="has-text-centered">0.716</td>
</tr>
<tr>
<td>Cohere Command R +</td>
<td class="has-text-centered">0.776</td>
<td class="has-text-centered">0.259</td>
<td class="has-text-centered">0.698</td>
</tr>
<tr>
<td>Google Gemini 1.5 Pro</td>
<td class="has-text-centered">0.829</td>
<td class="has-text-centered">0.280</td>
<td class="has-text-centered">0.763</td>
</tr>
<tr>
<td>OpenAI gpt-4o</td>
<td class="has-text-centered performance-low">0.836</td>
<td class="has-text-centered performance-low">0.749</td>
<td class="has-text-centered">0.754</td>
</tr>
<tr>
<td>OpenAI o1-mini</td>
<td class="has-text-centered">0.799</td>
<td class="has-text-centered performance-medium">0.840</td>
<td class="has-text-centered">0.698</td>
</tr>
</tbody>
</table>
<div class="content is-small mt-4">
<p><strong>Note:</strong> Color highlighting indicates performance ranking:
<span class="performance-best">&nbsp;Best&nbsp;</span>,
<span class="performance-medium">&nbsp;Strong&nbsp;</span>,
<span class="performance-low">&nbsp;Good&nbsp;</span>
</p>
</div>
</div>
</div><!-- Sentiment Analysis -->
<div id="sentiment-analysis" class="tab-content">
<h2 class="title is-4">Sentiment Analysis Task Results</h2>
<div class="results-table">
<table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="3" class="has-text-centered tooltip-trigger" data-tooltip="FiQA Task 1 focuses on aspect-based financial sentiment analysis. Given a financial text, such as microblog posts or news headlines, systems predict sentiment scores on a continuous scale from -1 (negative) to 1 (positive). Evaluation metrics include MSE, MAE, and R-squared.">FiQA Task 1</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="Financial Phrase Bank contains 4,840 sentences from English-language financial news articles, categorized as positive, negative, or neutral. Each sentence reflects the sentiment an investor might perceive regarding its influence on stock prices. Annotated by 16 finance experts using majority voting.">Financial Phrase Bank (FPB)</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="Manually-annotated dataset focusing on subjectivity in Earnings Call Transcripts QA sessions. Includes 49,446 annotations across 2,747 QA pairs labeled on six subjectivity features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant.">SubjECTive-QA</th>
</tr>
<tr>
<th class="has-text-centered">MSE</th>
<th class="has-text-centered">MAE</th>
<th class="has-text-centered">r² Score</th>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered">Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 3 70B Instruct</td>
<td class="has-text-centered">0.123</td>
<td class="has-text-centered">0.290</td>
<td class="has-text-centered">0.272</td>
<td class="has-text-centered">0.901</td>
<td class="has-text-centered">0.904</td>
<td class="has-text-centered">0.901</td>
<td class="has-text-centered">0.902</td>
<td class="has-text-centered">0.652</td>
<td class="has-text-centered">0.573</td>
<td class="has-text-centered">0.535</td>
<td class="has-text-centered">0.573</td>
</tr>
<tr>
<td>Llama 3 8B Instruct</td>
<td class="has-text-centered">0.161</td>
<td class="has-text-centered">0.344</td>
<td class="has-text-centered">0.045</td>
<td class="has-text-centered">0.738</td>
<td class="has-text-centered">0.801</td>
<td class="has-text-centered">0.738</td>
<td class="has-text-centered">0.698</td>
<td class="has-text-centered">0.635</td>
<td class="has-text-centered">0.625</td>
<td class="has-text-centered performance-best">0.600</td>
<td class="has-text-centered">0.625</td>
</tr>
<tr>
<td>DBRX Instruct</td>
<td class="has-text-centered">0.160</td>
<td class="has-text-centered">0.321</td>
<td class="has-text-centered">0.052</td>
<td class="has-text-centered">0.524</td>
<td class="has-text-centered">0.727</td>
<td class="has-text-centered">0.524</td>
<td class="has-text-centered">0.499</td>
<td class="has-text-centered">0.654</td>
<td class="has-text-centered">0.541</td>
<td class="has-text-centered">0.436</td>
<td class="has-text-centered">0.541</td>
</tr>
<tr>
<td>DeepSeek LLM (67B)</td>
<td class="has-text-centered">0.118</td>
<td class="has-text-centered">0.278</td>
<td class="has-text-centered">0.302</td>
<td class="has-text-centered">0.815</td>
<td class="has-text-centered">0.867</td>
<td class="has-text-centered">0.815</td>
<td class="has-text-centered">0.811</td>
<td class="has-text-centered">0.676</td>
<td class="has-text-centered">0.544</td>
<td class="has-text-centered">0.462</td>
<td class="has-text-centered">0.544</td>
</tr>
<tr>
<td>Gemma 2 27B</td>
<td class="has-text-centered performance-best">0.100</td>
<td class="has-text-centered performance-best">0.266</td>
<td class="has-text-centered">0.406</td>
<td class="has-text-centered">0.890</td>
<td class="has-text-centered">0.896</td>
<td class="has-text-centered">0.890</td>
<td class="has-text-centered">0.884</td>
<td class="has-text-centered">0.562</td>
<td class="has-text-centered">0.524</td>
<td class="has-text-centered">0.515</td>
<td class="has-text-centered">0.524</td>
</tr>
<tr>
<td>Gemma 2 9B</td>
<td class="has-text-centered">0.189</td>
<td class="has-text-centered">0.352</td>
<td class="has-text-centered">-0.120</td>
<td class="has-text-centered performance-strong">0.940</td>
<td class="has-text-centered performance-strong">0.941</td>
<td class="has-text-centered performance-strong">0.940</td>
<td class="has-text-centered performance-strong">0.940</td>
<td class="has-text-centered">0.570</td>
<td class="has-text-centered">0.499</td>
<td class="has-text-centered">0.491</td>
<td class="has-text-centered">0.499</td>
</tr>
<tr>
<td>Mistral (7B) Instruct v0.3</td>
<td class="has-text-centered">0.135</td>
<td class="has-text-centered">0.278</td>
<td class="has-text-centered">0.200</td>
<td class="has-text-centered">0.847</td>
<td class="has-text-centered">0.854</td>
<td class="has-text-centered">0.847</td>
<td class="has-text-centered">0.841</td>
<td class="has-text-centered">0.607</td>
<td class="has-text-centered">0.542</td>
<td class="has-text-centered">0.522</td>
<td class="has-text-centered">0.542</td>
</tr>
<tr>
<td>Mixtral-8x22B Instruct</td>
<td class="has-text-centered">0.221</td>
<td class="has-text-centered">0.364</td>
<td class="has-text-centered">-0.310</td>
<td class="has-text-centered">0.768</td>
<td class="has-text-centered">0.845</td>
<td class="has-text-centered">0.768</td>
<td class="has-text-centered">0.776</td>
<td class="has-text-centered">0.614</td>
<td class="has-text-centered">0.538</td>
<td class="has-text-centered">0.510</td>
<td class="has-text-centered">0.538</td>
</tr>
<tr>
<td>Mixtral-8x7B Instruct</td>
<td class="has-text-centered">0.208</td>
<td class="has-text-centered">0.307</td>
<td class="has-text-centered">-0.229</td>
<td class="has-text-centered">0.896</td>
<td class="has-text-centered">0.898</td>
<td class="has-text-centered">0.896</td>
<td class="has-text-centered">0.893</td>
<td class="has-text-centered">0.611</td>
<td class="has-text-centered">0.518</td>
<td class="has-text-centered">0.498</td>
<td class="has-text-centered">0.518</td>
</tr>
<tr>
<td>Qwen 2 Instruct (72B)</td>
<td class="has-text-centered">0.205</td>
<td class="has-text-centered">0.409</td>
<td class="has-text-centered">-0.212</td>
<td class="has-text-centered">0.904</td>
<td class="has-text-centered">0.908</td>
<td class="has-text-centered">0.904</td>
<td class="has-text-centered">0.901</td>
<td class="has-text-centered">0.644</td>
<td class="has-text-centered">0.601</td>
<td class="has-text-centered">0.576</td>
<td class="has-text-centered">0.601</td>
</tr>
<tr>
<td>WizardLM-2 8x22B</td>
<td class="has-text-centered">0.129</td>
<td class="has-text-centered">0.283</td>
<td class="has-text-centered">0.239</td>
<td class="has-text-centered">0.765</td>
<td class="has-text-centered">0.853</td>
<td class="has-text-centered">0.765</td>
<td class="has-text-centered">0.779</td>
<td class="has-text-centered">0.611</td>
<td class="has-text-centered">0.570</td>
<td class="has-text-centered">0.566</td>
<td class="has-text-centered">0.570</td>
</tr>
<tr>
<td>DeepSeek-V3</td>
<td class="has-text-centered">0.150</td>
<td class="has-text-centered">0.311</td>
<td class="has-text-centered">0.111</td>
<td class="has-text-centered">0.828</td>
<td class="has-text-centered">0.851</td>
<td class="has-text-centered">0.828</td>
<td class="has-text-centered">0.814</td>
<td class="has-text-centered">0.640</td>
<td class="has-text-centered">0.572</td>
<td class="has-text-centered performance-medium">0.583</td>
<td class="has-text-centered">0.572</td>
</tr>
<tr>
<td>DeepSeek R1</td>
<td class="has-text-centered performance-low">0.110</td>
<td class="has-text-centered">0.289</td>
<td class="has-text-centered">0.348</td>
<td class="has-text-centered">0.904</td>
<td class="has-text-centered">0.907</td>
<td class="has-text-centered">0.904</td>
<td class="has-text-centered">0.902</td>
<td class="has-text-centered">0.644</td>
<td class="has-text-centered">0.489</td>
<td class="has-text-centered">0.499</td>
<td class="has-text-centered">0.489</td>
</tr>
<tr>
<td>QwQ-32B-Preview</td>
<td class="has-text-centered">0.141</td>
<td class="has-text-centered">0.290</td>
<td class="has-text-centered">0.165</td>
<td class="has-text-centered">0.812</td>
<td class="has-text-centered">0.827</td>
<td class="has-text-centered">0.812</td>
<td class="has-text-centered">0.815</td>
<td class="has-text-centered">0.629</td>
<td class="has-text-centered">0.534</td>
<td class="has-text-centered">0.550</td>
<td class="has-text-centered">0.534</td>
</tr>
<tr>
<td>Jamba 1.5 Mini</td>
<td class="has-text-centered performance-low">0.119</td>
<td class="has-text-centered">0.282</td>
<td class="has-text-centered">0.293</td>
<td class="has-text-centered">0.784</td>
<td class="has-text-centered">0.814</td>
<td class="has-text-centered">0.784</td>
<td class="has-text-centered">0.765</td>
<td class="has-text-centered">0.380</td>
<td class="has-text-centered">0.525</td>
<td class="has-text-centered">0.418</td>
<td class="has-text-centered">0.525</td>
</tr>
<tr>
<td>Jamba 1.5 Large</td>
<td class="has-text-centered">0.183</td>
<td class="has-text-centered">0.363</td>
<td class="has-text-centered">-0.085</td>
<td class="has-text-centered">0.824</td>
<td class="has-text-centered">0.850</td>
<td class="has-text-centered">0.824</td>
<td class="has-text-centered">0.798</td>
<td class="has-text-centered">0.635</td>
<td class="has-text-centered">0.573</td>
<td class="has-text-centered performance-medium">0.582</td>
<td class="has-text-centered">0.573</td>
</tr>
<tr>
<td>Claude 3.5 Sonnet</td>
<td class="has-text-centered performance-low">0.101</td>
<td class="has-text-centered performance-low">0.268</td>
<td class="has-text-centered performance-best">0.402</td>
<td class="has-text-centered performance-best">0.944</td>
<td class="has-text-centered performance-best">0.945</td>
<td class="has-text-centered performance-best">0.944</td>
<td class="has-text-centered performance-best">0.944</td>
<td class="has-text-centered">0.634</td>
<td class="has-text-centered performance-medium">0.585</td>
<td class="has-text-centered">0.553</td>
<td class="has-text-centered performance-medium">0.585</td>
</tr>
<tr>
<td>Claude 3 Haiku</td>
<td class="has-text-centered">0.167</td>
<td class="has-text-centered">0.349</td>
<td class="has-text-centered">0.008</td>
<td class="has-text-centered">0.907</td>
<td class="has-text-centered">0.913</td>
<td class="has-text-centered">0.907</td>
<td class="has-text-centered">0.908</td>
<td class="has-text-centered">0.619</td>
<td class="has-text-centered">0.538</td>
<td class="has-text-centered">0.463</td>
<td class="has-text-centered">0.538</td>
</tr>
<tr>
<td>Cohere Command R 7B</td>
<td class="has-text-centered">0.164</td>
<td class="has-text-centered">0.319</td>
<td class="has-text-centered">0.028</td>
<td class="has-text-centered">0.835</td>
<td class="has-text-centered">0.861</td>
<td class="has-text-centered">0.835</td>
<td class="has-text-centered">0.840</td>
<td class="has-text-centered">0.609</td>
<td class="has-text-centered">0.547</td>
<td class="has-text-centered">0.532</td>
<td class="has-text-centered">0.547</td>
</tr>
<tr>
<td>Cohere Command R +</td>
<td class="has-text-centered performance-medium">0.106</td>
<td class="has-text-centered">0.274</td>
<td class="has-text-centered performance-medium">0.373</td>
<td class="has-text-centered">0.741</td>
<td class="has-text-centered">0.806</td>
<td class="has-text-centered">0.741</td>
<td class="has-text-centered">0.699</td>
<td class="has-text-centered">0.608</td>
<td class="has-text-centered">0.547</td>
<td class="has-text-centered">0.533</td>
<td class="has-text-centered">0.547</td>
</tr>
<tr>
<td>Google Gemini 1.5 Pro</td>
<td class="has-text-centered">0.144</td>
<td class="has-text-centered">0.329</td>
<td class="has-text-centered">0.149</td>
<td class="has-text-centered">0.890</td>
<td class="has-text-centered">0.895</td>
<td class="has-text-centered">0.890</td>
<td class="has-text-centered">0.885</td>
<td class="has-text-centered">0.642</td>
<td class="has-text-centered performance-medium">0.587</td>
<td class="has-text-centered performance-best">0.593</td>
<td class="has-text-centered performance-best">0.587</td>
</tr>
<tr>
<td>OpenAI gpt-4o</td>
<td class="has-text-centered">0.184</td>
<td class="has-text-centered">0.317</td>
<td class="has-text-centered">-0.089</td>
<td class="has-text-centered">0.929</td>
<td class="has-text-centered">0.931</td>
<td class="has-text-centered">0.929</td>
<td class="has-text-centered">0.928</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.515</td>
<td class="has-text-centered">0.541</td>
<td class="has-text-centered">0.515</td>
</tr>
<tr>
<td>OpenAI o1-mini</td>
<td class="has-text-centered performance-medium">0.120</td>
<td class="has-text-centered">0.295</td>
<td class="has-text-centered">0.289</td>
<td class="has-text-centered">0.918</td>
<td class="has-text-centered">0.917</td>
<td class="has-text-centered">0.918</td>
<td class="has-text-centered">0.917</td>
<td class="has-text-centered performance-best">0.660</td>
<td class="has-text-centered">0.515</td>
<td class="has-text-centered">0.542</td>
<td class="has-text-centered">0.515</td>
</tr>
</tbody>
</table>
<div class="content is-small mt-4">
<p><strong>Note:</strong> Color highlighting indicates performance ranking:
<span class="performance-best">&nbsp;Best&nbsp;</span>,
<span class="performance-medium">&nbsp;Strong&nbsp;</span>,
<span class="performance-low">&nbsp;Good&nbsp;</span>
</p>
</div>
</div>
</div><!-- Text Classification -->
<div id="text-classification" class="tab-content">
<h2 class="title is-4">Text Classification Task Results</h2>
<div class="results-table">
<table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="A fine-grained dataset designed for intent detection within the banking domain, comprising 13,083 customer service queries annotated with 77 unique intents.">Banking77</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="A dataset designed to evaluate machine learning models using tabular data and profile text inputs for financial risk prediction, covering default, fraud, and churn with 333,000 labeled instances.">FinBench</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="A dataset of Federal Open Market Committee speeches, meeting minutes, and press conference transcripts (1996-2022) for hawkish-dovish classification of monetary policy stance.">FOMC</th>
<th colspan="4" class="has-text-centered tooltip-trigger" data-tooltip="An expert-annotated dataset for detecting fine-grained investor claims within financial narratives, focusing on numerals in analyst reports and earnings call transcripts.">NumClaim</th>
<th colspan="1" class="has-text-centered tooltip-trigger" data-tooltip="A dataset of 11,412 human-annotated financial news headlines focused on commodities (particularly gold), spanning 2000-2019, with binary indicators for price mentions and movements.">Headlines</th>
</tr>
<tr>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered">Accuracy</th>
<th class="has-text-centered">Precision</th>
<th class="has-text-centered">Recall</th>
<th class="has-text-centered">F1</th>
<th class="has-text-centered">Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 3 70B Instruct</td>
<td class="has-text-centered">0.660</td>
<td class="has-text-centered">0.748</td>
<td class="has-text-centered">0.660</td>
<td class="has-text-centered">0.645</td>
<td class="has-text-centered">0.222</td>
<td class="has-text-centered">0.826</td>
<td class="has-text-centered">0.222</td>
<td class="has-text-centered">0.309</td>
<td class="has-text-centered">0.661</td>
<td class="has-text-centered">0.662</td>
<td class="has-text-centered">0.661</td>
<td class="has-text-centered">0.652</td>
<td class="has-text-centered">0.430</td>
<td class="has-text-centered">0.240</td>
<td class="has-text-centered performance-medium">0.980</td>
<td class="has-text-centered">0.386</td>
<td class="has-text-centered">0.811</td>
</tr>
<tr>
<td>Llama 3 8B Instruct</td>
<td class="has-text-centered">0.534</td>
<td class="has-text-centered">0.672</td>
<td class="has-text-centered">0.534</td>
<td class="has-text-centered">0.512</td>
<td class="has-text-centered">0.543</td>
<td class="has-text-centered">0.857</td>
<td class="has-text-centered">0.543</td>
<td class="has-text-centered">0.659</td>
<td class="has-text-centered">0.565</td>
<td class="has-text-centered">0.618</td>
<td class="has-text-centered">0.565</td>
<td class="has-text-centered">0.497</td>
<td class="has-text-centered">0.801</td>
<td class="has-text-centered">0.463</td>
<td class="has-text-centered">0.571</td>
<td class="has-text-centered">0.511</td>
<td class="has-text-centered">0.763</td>
</tr>
<tr>
<td>DBRX Instruct</td>
<td class="has-text-centered">0.578</td>
<td class="has-text-centered">0.706</td>
<td class="has-text-centered">0.578</td>
<td class="has-text-centered">0.574</td>
<td class="has-text-centered">0.359</td>
<td class="has-text-centered">0.851</td>
<td class="has-text-centered">0.359</td>
<td class="has-text-centered">0.483</td>
<td class="has-text-centered">0.285</td>
<td class="has-text-centered">0.572</td>
<td class="has-text-centered">0.285</td>
<td class="has-text-centered">0.193</td>
<td class="has-text-centered">0.222</td>
<td class="has-text-centered">0.190</td>
<td class="has-text-centered performance-best">1.000</td>
<td class="has-text-centered">0.319</td>
<td class="has-text-centered">0.746</td>
</tr>
<tr>
<td>DeepSeek LLM (67B)</td>
<td class="has-text-centered">0.596</td>
<td class="has-text-centered">0.711</td>
<td class="has-text-centered">0.596</td>
<td class="has-text-centered">0.578</td>
<td class="has-text-centered">0.369</td>
<td class="has-text-centered">0.856</td>
<td class="has-text-centered">0.369</td>
<td class="has-text-centered">0.492</td>
<td class="has-text-centered">0.532</td>
<td class="has-text-centered">0.678</td>
<td class="has-text-centered">0.532</td>
<td class="has-text-centered">0.407</td>
<td class="has-text-centered">0.832</td>
<td class="has-text-centered performance-best">1.000</td>
<td class="has-text-centered">0.082</td>
<td class="has-text-centered">0.151</td>
<td class="has-text-centered">0.778</td>
</tr>
<tr>
<td>Gemma 2 27B</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.730</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.621</td>
<td class="has-text-centered">0.410</td>
<td class="has-text-centered">0.849</td>
<td class="has-text-centered">0.410</td>
<td class="has-text-centered">0.538</td>
<td class="has-text-centered">0.651</td>
<td class="has-text-centered">0.704</td>
<td class="has-text-centered">0.651</td>
<td class="has-text-centered">0.620</td>
<td class="has-text-centered">0.471</td>
<td class="has-text-centered">0.257</td>
<td class="has-text-centered performance-best">1.000</td>
<td class="has-text-centered">0.408</td>
<td class="has-text-centered">0.808</td>
</tr>
<tr>
<td>Gemma 2 9B</td>
<td class="has-text-centered">0.630</td>
<td class="has-text-centered">0.710</td>
<td class="has-text-centered">0.630</td>
<td class="has-text-centered">0.609</td>
<td class="has-text-centered">0.412</td>
<td class="has-text-centered">0.848</td>
<td class="has-text-centered">0.412</td>
<td class="has-text-centered">0.541</td>
<td class="has-text-centered">0.595</td>
<td class="has-text-centered">0.694</td>
<td class="has-text-centered">0.595</td>
<td class="has-text-centered">0.519</td>
<td class="has-text-centered">0.371</td>
<td class="has-text-centered">0.224</td>
<td class="has-text-centered performance-strong">0.990</td>
<td class="has-text-centered">0.365</td>
<td class="has-text-centered performance-best">0.856</td>
</tr>
<tr>
<td>Mistral (7B) Instruct v0.3</td>
<td class="has-text-centered">0.547</td>
<td class="has-text-centered">0.677</td>
<td class="has-text-centered">0.547</td>
<td class="has-text-centered">0.528</td>
<td class="has-text-centered">0.375</td>
<td class="has-text-centered">0.839</td>
<td class="has-text-centered">0.375</td>
<td class="has-text-centered">0.503</td>
<td class="has-text-centered">0.587</td>
<td class="has-text-centered">0.598</td>
<td class="has-text-centered">0.587</td>
<td class="has-text-centered">0.542</td>
<td class="has-text-centered">0.521</td>
<td class="has-text-centered">0.266</td>
<td class="has-text-centered">0.918</td>
<td class="has-text-centered">0.412</td>
<td class="has-text-centered">0.779</td>
</tr>
<tr>
<td>Mixtral-8x22B Instruct</td>
<td class="has-text-centered">0.622</td>
<td class="has-text-centered">0.718</td>
<td class="has-text-centered">0.622</td>
<td class="has-text-centered">0.602</td>
<td class="has-text-centered">0.166</td>
<td class="has-text-centered">0.811</td>
<td class="has-text-centered">0.166</td>
<td class="has-text-centered">0.221</td>
<td class="has-text-centered">0.562</td>
<td class="has-text-centered">0.709</td>
<td class="has-text-centered">0.562</td>
<td class="has-text-centered">0.465</td>
<td class="has-text-centered">0.732</td>
<td class="has-text-centered">0.384</td>
<td class="has-text-centered">0.775</td>
<td class="has-text-centered">0.513</td>
<td class="has-text-centered performance-medium">0.835</td>
</tr>
<tr>
<td>Mixtral-8x7B Instruct</td>
<td class="has-text-centered">0.567</td>
<td class="has-text-centered">0.693</td>
<td class="has-text-centered">0.567</td>
<td class="has-text-centered">0.547</td>
<td class="has-text-centered">0.285</td>
<td class="has-text-centered">0.838</td>
<td class="has-text-centered">0.285</td>
<td class="has-text-centered">0.396</td>
<td class="has-text-centered">0.623</td>
<td class="has-text-centered">0.636</td>
<td class="has-text-centered">0.623</td>
<td class="has-text-centered">0.603</td>
<td class="has-text-centered">0.765</td>
<td class="has-text-centered">0.431</td>
<td class="has-text-centered">0.898</td>
<td class="has-text-centered">0.583</td>
<td class="has-text-centered">0.805</td>
</tr>
<tr>
<td>Qwen 2 Instruct (72B)</td>
<td class="has-text-centered">0.644</td>
<td class="has-text-centered">0.730</td>
<td class="has-text-centered">0.644</td>
<td class="has-text-centered">0.627</td>
<td class="has-text-centered">0.370</td>
<td class="has-text-centered">0.848</td>
<td class="has-text-centered">0.370</td>
<td class="has-text-centered">0.495</td>
<td class="has-text-centered">0.623</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.623</td>
<td class="has-text-centered">0.605</td>
<td class="has-text-centered">0.821</td>
<td class="has-text-centered">0.506</td>
<td class="has-text-centered">0.867</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.830</td>
</tr>
<tr>
<td>WizardLM-2 8x22B</td>
<td class="has-text-centered">0.664</td>
<td class="has-text-centered">0.737</td>
<td class="has-text-centered">0.664</td>
<td class="has-text-centered">0.648</td>
<td class="has-text-centered">0.373</td>
<td class="has-text-centered">0.842</td>
<td class="has-text-centered">0.373</td>
<td class="has-text-centered">0.500</td>
<td class="has-text-centered">0.583</td>
<td class="has-text-centered performance-medium">0.710</td>
<td class="has-text-centered">0.583</td>
<td class="has-text-centered">0.505</td>
<td class="has-text-centered">0.831</td>
<td class="has-text-centered">0.630</td>
<td class="has-text-centered">0.173</td>
<td class="has-text-centered">0.272</td>
<td class="has-text-centered">0.797</td>
</tr>
<tr>
<td>DeepSeek-V3</td>
<td class="has-text-centered performance-strong">0.722</td>
<td class="has-text-centered performance-medium">0.774</td>
<td class="has-text-centered performance-strong">0.722</td>
<td class="has-text-centered performance-strong">0.714</td>
<td class="has-text-centered">0.362</td>
<td class="has-text-centered">0.845</td>
<td class="has-text-centered">0.362</td>
<td class="has-text-centered">0.487</td>
<td class="has-text-centered">0.625</td>
<td class="has-text-centered performance-strong">0.712</td>
<td class="has-text-centered">0.625</td>
<td class="has-text-centered">0.578</td>
<td class="has-text-centered">0.860</td>
<td class="has-text-centered">0.586</td>
<td class="has-text-centered">0.796</td>
<td class="has-text-centered">0.675</td>
<td class="has-text-centered">0.729</td>
</tr>
<tr>
<td>DeepSeek R1</td>
<td class="has-text-centered performance-best">0.772</td>
<td class="has-text-centered performance-strong">0.789</td>
<td class="has-text-centered performance-best">0.772</td>
<td class="has-text-centered performance-best">0.763</td>
<td class="has-text-centered">0.306</td>
<td class="has-text-centered">0.846</td>
<td class="has-text-centered">0.306</td>
<td class="has-text-centered">0.419</td>
<td class="has-text-centered performance-strong">0.679</td>
<td class="has-text-centered">0.682</td>
<td class="has-text-centered performance-strong">0.679</td>
<td class="has-text-centered performance-strong">0.670</td>
<td class="has-text-centered">0.851</td>
<td class="has-text-centered">0.557</td>
<td class="has-text-centered">0.898</td>
<td class="has-text-centered">0.688</td>
<td class="has-text-centered">0.769</td>
</tr>
<tr>
<td>QwQ-32B-Preview</td>
<td class="has-text-centered">0.577</td>
<td class="has-text-centered">0.747</td>
<td class="has-text-centered">0.577</td>
<td class="has-text-centered">0.613</td>
<td class="has-text-centered performance-strong">0.716</td>
<td class="has-text-centered performance-strong">0.871</td>
<td class="has-text-centered performance-strong">0.716</td>
<td class="has-text-centered performance-strong">0.784</td>
<td class="has-text-centered">0.591</td>
<td class="has-text-centered">0.630</td>
<td class="has-text-centered">0.591</td>
<td class="has-text-centered">0.555</td>
<td class="has-text-centered">0.819</td>
<td class="has-text-centered performance-best">1.000</td>
<td class="has-text-centered">0.010</td>
<td class="has-text-centered">0.020</td>
<td class="has-text-centered">0.744</td>
</tr>
<tr>
<td>Jamba 1.5 Mini</td>
<td class="has-text-centered">0.528</td>
<td class="has-text-centered">0.630</td>
<td class="has-text-centered">0.528</td>
<td class="has-text-centered">0.508</td>
<td class="has-text-centered performance-best">0.913</td>
<td class="has-text-centered performance-best">0.883</td>
<td class="has-text-centered performance-best">0.913</td>
<td class="has-text-centered performance-best">0.898</td>
<td class="has-text-centered">0.572</td>
<td class="has-text-centered">0.678</td>
<td class="has-text-centered">0.572</td>
<td class="has-text-centered">0.499</td>
<td class="has-text-centered">0.812</td>
<td class="has-text-centered">0.429</td>
<td class="has-text-centered">0.092</td>
<td class="has-text-centered">0.151</td>
<td class="has-text-centered">0.682</td>
</tr>
<tr>
<td>Jamba 1.5 Large</td>
<td class="has-text-centered">0.642</td>
<td class="has-text-centered">0.746</td>
<td class="has-text-centered">0.642</td>
<td class="has-text-centered">0.628</td>
<td class="has-text-centered">0.494</td>
<td class="has-text-centered">0.851</td>
<td class="has-text-centered">0.494</td>
<td class="has-text-centered">0.618</td>
<td class="has-text-centered">0.597</td>
<td class="has-text-centered">0.650</td>
<td class="has-text-centered">0.597</td>
<td class="has-text-centered">0.550</td>
<td class="has-text-centered">0.855</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.469</td>
<td class="has-text-centered">0.541</td>
<td class="has-text-centered">0.782</td>
</tr>
<tr>
<td>Claude 3.5 Sonnet</td>
<td class="has-text-centered">0.682</td>
<td class="has-text-centered">0.755</td>
<td class="has-text-centered">0.682</td>
<td class="has-text-centered">0.668</td>
<td class="has-text-centered">0.513</td>
<td class="has-text-centered">0.854</td>
<td class="has-text-centered">0.513</td>
<td class="has-text-centered">0.634</td>
<td class="has-text-centered performance-medium">0.675</td>
<td class="has-text-centered">0.677</td>
<td class="has-text-centered performance-medium">0.675</td>
<td class="has-text-centered performance-best">0.674</td>
<td class="has-text-centered performance-medium">0.879</td>
<td class="has-text-centered">0.646</td>
<td class="has-text-centered">0.745</td>
<td class="has-text-centered performance-medium">0.692</td>
<td class="has-text-centered">0.827</td>
</tr>
<tr>
<td>Claude 3 Haiku</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.735</td>
<td class="has-text-centered">0.639</td>
<td class="has-text-centered">0.622</td>
<td class="has-text-centered">0.067</td>
<td class="has-text-centered">0.674</td>
<td class="has-text-centered">0.067</td>
<td class="has-text-centered">0.022</td>
<td class="has-text-centered">0.633</td>
<td class="has-text-centered">0.634</td>
<td class="has-text-centered">0.633</td>
<td class="has-text-centered">0.631</td>
<td class="has-text-centered">0.838</td>
<td class="has-text-centered">0.556</td>
<td class="has-text-centered">0.561</td>
<td class="has-text-centered">0.558</td>
<td class="has-text-centered">0.781</td>
</tr>
<tr>
<td>Cohere Command R 7B</td>
<td class="has-text-centered">0.530</td>
<td class="has-text-centered">0.650</td>
<td class="has-text-centered">0.530</td>
<td class="has-text-centered">0.516</td>
<td class="has-text-centered performance-medium">0.682</td>
<td class="has-text-centered performance-medium">0.868</td>
<td class="has-text-centered performance-medium">0.682</td>
<td class="has-text-centered performance-medium">0.762</td>
<td class="has-text-centered">0.536</td>
<td class="has-text-centered">0.505</td>
<td class="has-text-centered">0.536</td>
<td class="has-text-centered">0.459</td>
<td class="has-text-centered">0.797</td>
<td class="has-text-centered">0.210</td>
<td class="has-text-centered">0.041</td>
<td class="has-text-centered">0.068</td>
<td class="has-text-centered">0.770</td>
</tr>
<tr>
<td>Cohere Command R +</td>
<td class="has-text-centered">0.660</td>
<td class="has-text-centered">0.747</td>
<td class="has-text-centered">0.660</td>
<td class="has-text-centered">0.651</td>
<td class="has-text-centered">0.575</td>
<td class="has-text-centered">0.859</td>
<td class="has-text-centered">0.575</td>
<td class="has-text-centered">0.684</td>
<td class="has-text-centered">0.526</td>
<td class="has-text-centered">0.655</td>
<td class="has-text-centered">0.526</td>
<td class="has-text-centered">0.393</td>
<td class="has-text-centered">0.804</td>
<td class="has-text-centered">0.333</td>
<td class="has-text-centered">0.071</td>
<td class="has-text-centered">0.118</td>
<td class="has-text-centered">0.812</td>
</tr>
<tr>
<td>Google Gemini 1.5 Pro</td>
<td class="has-text-centered">0.483</td>
<td class="has-text-centered">0.487</td>
<td class="has-text-centered">0.483</td>
<td class="has-text-centered">0.418</td>
<td class="has-text-centered">0.240</td>
<td class="has-text-centered">0.823</td>
<td class="has-text-centered">0.240</td>
<td class="has-text-centered">0.336</td>
<td class="has-text-centered">0.619</td>
<td class="has-text-centered">0.667</td>
<td class="has-text-centered">0.619</td>
<td class="has-text-centered">0.579</td>
<td class="has-text-centered">0.700</td>
<td class="has-text-centered">0.369</td>
<td class="has-text-centered">0.908</td>
<td class="has-text-centered">0.525</td>
<td class="has-text-centered performance-strong">0.837</td>
</tr>
<tr>
<td>OpenAI gpt-4o</td>
<td class="has-text-centered performance-medium">0.704</td>
<td class="has-text-centered performance-best">0.792</td>
<td class="has-text-centered performance-medium">0.704</td>
<td class="has-text-centered performance-medium">0.710</td>
<td class="has-text-centered">0.396</td>
<td class="has-text-centered">0.846</td>
<td class="has-text-centered">0.396</td>
<td class="has-text-centered">0.524</td>
<td class="has-text-centered performance-best">0.681</td>
<td class="has-text-centered performance-best">0.719</td>
<td class="has-text-centered performance-best">0.681</td>
<td class="has-text-centered performance-medium">0.664</td>
<td class="has-text-centered performance-best">0.896</td>
<td class="has-text-centered performance-medium">0.667</td>
<td class="has-text-centered">0.857</td>
<td class="has-text-centered performance-best">0.750</td>
<td class="has-text-centered">0.824</td>
</tr>
<tr>
<td>OpenAI o1-mini</td>
<td class="has-text-centered">0.681</td>
<td class="has-text-centered">0.760</td>
<td class="has-text-centered">0.681</td>
<td class="has-text-centered">0.670</td>
<td class="has-text-centered">0.487</td>
<td class="has-text-centered">0.851</td>
<td class="has-text-centered">0.487</td>
<td class="has-text-centered">0.612</td>
<td class="has-text-centered">0.651</td>
<td class="has-text-centered">0.670</td>
<td class="has-text-centered">0.651</td>
<td class="has-text-centered">0.635</td>
<td class="has-text-centered performance-strong">0.888</td>
<td class="has-text-centered performance-medium">0.664</td>
<td class="has-text-centered">0.786</td>
<td class="has-text-centered performance-strong">0.720</td>
<td class="has-text-centered">0.769</td>
</tr>
</tbody>
</table>
<div class="content is-small mt-4">
<p><strong>Note:</strong> Color highlighting indicates performance ranking:
<span class="performance-best">&nbsp;Best&nbsp;</span>,
<span class="performance-medium">&nbsp;Strong&nbsp;</span>,
<span class="performance-low">&nbsp;Good&nbsp;</span>
</p>
</div>
</div>
</div><!-- Text Summarization -->
<div id="text-summarization" class="tab-content">
<h2 class="title is-4">Text Summarization Task Results</h2>
<div class="results-table">
<table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="3" class="has-text-centered tooltip-trigger" data-tooltip="Designed for bullet-point summarization of long earnings call transcripts (ECTs) in the financial domain. 2,425 document-summary pairs from publicly traded companies' earnings calls (2019-2022), with concise bullet points extracted from Reuters articles focusing on key financial metrics.">ECTSum</th>
<th colspan="3" class="has-text-centered tooltip-trigger" data-tooltip="Financial news summarization dataset with 2,000 financial news articles, each paired with its headline as the ground-truth summary. Manually selected and cleaned to ensure high-quality annotations, providing a benchmark for evaluating LLMs on financial text summarization.">EDTSum</th>
</tr>
<tr>
<th class="has-text-centered">BERTScore Precision</th>
<th class="has-text-centered">BERTScore Recall</th>
<th class="has-text-centered">BERTScore F1</th>
<th class="has-text-centered">BERTScore Precision</th>
<th class="has-text-centered">BERTScore Recall</th>
<th class="has-text-centered">BERTScore F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 3 70B Instruct</td>
<td class="has-text-centered">0.715</td>
<td class="has-text-centered">0.801</td>
<td class="has-text-centered">0.754</td>
<td class="has-text-centered">0.793</td>
<td class="has-text-centered performance-medium">0.844</td>
<td class="has-text-centered performance-strong">0.817</td>
</tr>
<tr>
<td>Llama 3 8B Instruct</td>
<td class="has-text-centered">0.724</td>
<td class="has-text-centered">0.796</td>
<td class="has-text-centered">0.757</td>
<td class="has-text-centered">0.785</td>
<td class="has-text-centered">0.841</td>
<td class="has-text-centered">0.811</td>
</tr>
<tr>
<td>DBRX Instruct</td>
<td class="has-text-centered">0.680</td>
<td class="has-text-centered">0.786</td>
<td class="has-text-centered">0.729</td>
<td class="has-text-centered">0.774</td>
<td class="has-text-centered">0.843</td>
<td class="has-text-centered">0.806</td>
</tr>
<tr>
<td>DeepSeek LLM (67B)</td>
<td class="has-text-centered">0.692</td>
<td class="has-text-centered">0.678</td>
<td class="has-text-centered">0.681</td>
<td class="has-text-centered">0.779</td>
<td class="has-text-centered">0.840</td>
<td class="has-text-centered">0.807</td>
</tr>
<tr>
<td>Gemma 2 27B</td>
<td class="has-text-centered">0.680</td>
<td class="has-text-centered">0.777</td>
<td class="has-text-centered">0.723</td>
<td class="has-text-centered performance-strong">0.801</td>
<td class="has-text-centered">0.829</td>
<td class="has-text-centered">0.814</td>
</tr>
<tr>
<td>Gemma 2 9B</td>
<td class="has-text-centered">0.651</td>
<td class="has-text-centered">0.531</td>
<td class="has-text-centered">0.585</td>
<td class="has-text-centered performance-best">0.803</td>
<td class="has-text-centered">0.833</td>
<td class="has-text-centered performance-strong">0.817</td>
</tr>
<tr>
<td>Mistral (7B) Instruct v0.3</td>
<td class="has-text-centered">0.702</td>
<td class="has-text-centered performance-strong">0.806</td>
<td class="has-text-centered">0.750</td>
<td class="has-text-centered">0.783</td>
<td class="has-text-centered">0.842</td>
<td class="has-text-centered">0.811</td>
</tr>
<tr>
<td>Mixtral-8x22B Instruct</td>
<td class="has-text-centered">0.713</td>
<td class="has-text-centered performance-best">0.812</td>
<td class="has-text-centered">0.758</td>
<td class="has-text-centered">0.790</td>
<td class="has-text-centered">0.843</td>
<td class="has-text-centered">0.815</td>
</tr>
<tr>
<td>Mixtral-8x7B Instruct</td>
<td class="has-text-centered">0.727</td>
<td class="has-text-centered">0.773</td>
<td class="has-text-centered">0.747</td>
<td class="has-text-centered">0.785</td>
<td class="has-text-centered">0.839</td>
<td class="has-text-centered">0.810</td>
</tr>
<tr>
<td>Qwen 2 Instruct (72B)</td>
<td class="has-text-centered">0.709</td>
<td class="has-text-centered performance-medium">0.804</td>
<td class="has-text-centered">0.752</td>
<td class="has-text-centered">0.781</td>
<td class="has-text-centered performance-strong">0.846</td>
<td class="has-text-centered">0.811</td>
</tr>
<tr>
<td>WizardLM-2 8x22B</td>
<td class="has-text-centered">0.677</td>
<td class="has-text-centered performance-strong">0.806</td>
<td class="has-text-centered">0.735</td>
<td class="has-text-centered">0.774</td>
<td class="has-text-centered performance-best">0.847</td>
<td class="has-text-centered">0.808</td>
</tr>
<tr>
<td>DeepSeek-V3</td>
<td class="has-text-centered">0.703</td>
<td class="has-text-centered performance-strong">0.806</td>
<td class="has-text-centered">0.750</td>
<td class="has-text-centered">0.791</td>
<td class="has-text-centered">0.842</td>
<td class="has-text-centered">0.815</td>
</tr>
<tr>
<td>DeepSeek R1</td>
<td class="has-text-centered">0.724</td>
<td class="has-text-centered">0.800</td>
<td class="has-text-centered">0.759</td>
<td class="has-text-centered">0.770</td>
<td class="has-text-centered">0.843</td>
<td class="has-text-centered">0.804</td>
</tr>
<tr>
<td>QwQ-32B-Preview</td>
<td class="has-text-centered">0.653</td>
<td class="has-text-centered">0.751</td>
<td class="has-text-centered">0.696</td>
<td class="has-text-centered">0.797</td>
<td class="has-text-centered">0.841</td>
<td class="has-text-centered performance-strong">0.817</td>
</tr>
<tr>
<td>Jamba 1.5 Mini</td>
<td class="has-text-centered">0.692</td>
<td class="has-text-centered">0.798</td>
<td class="has-text-centered">0.741</td>
<td class="has-text-centered">0.798</td>
<td class="has-text-centered">0.838</td>
<td class="has-text-centered performance-medium">0.816</td>
</tr>
<tr>
<td>Jamba 1.5 Large</td>
<td class="has-text-centered">0.679</td>
<td class="has-text-centered">0.800</td>
<td class="has-text-centered">0.734</td>
<td class="has-text-centered">0.799</td>
<td class="has-text-centered">0.841</td>
<td class="has-text-centered performance-best">0.818</td>
</tr>
<tr>
<td>Claude 3.5 Sonnet</td>
<td class="has-text-centered performance-medium">0.737</td>
<td class="has-text-centered">0.802</td>
<td class="has-text-centered performance-medium">0.767</td>
<td class="has-text-centered">0.786</td>
<td class="has-text-centered">0.843</td>
<td class="has-text-centered">0.813</td>
</tr>
<tr>
<td>Claude 3 Haiku</td>
<td class="has-text-centered">0.683</td>
<td class="has-text-centered">0.617</td>
<td class="has-text-centered">0.646</td>
<td class="has-text-centered">0.778</td>
<td class="has-text-centered performance-medium">0.844</td>
<td class="has-text-centered">0.808</td>
</tr>
<tr>
<td>Cohere Command R 7B</td>
<td class="has-text-centered">0.724</td>
<td class="has-text-centered">0.781</td>
<td class="has-text-centered">0.750</td>
<td class="has-text-centered">0.790</td>
<td class="has-text-centered performance-medium">0.844</td>
<td class="has-text-centered">0.815</td>
</tr>
<tr>
<td>Cohere Command R +</td>
<td class="has-text-centered">0.724</td>
<td class="has-text-centered">0.782</td>
<td class="has-text-centered">0.751</td>
<td class="has-text-centered">0.789</td>
<td class="has-text-centered">0.834</td>
<td class="has-text-centered">0.810</td>
</tr>
<tr>
<td>Google Gemini 1.5 Pro</td>
<td class="has-text-centered performance-best">0.757</td>
<td class="has-text-centered">0.800</td>
<td class="has-text-centered performance-best">0.777</td>
<td class="has-text-centered performance-medium">0.800</td>
<td class="has-text-centered">0.836</td>
<td class="has-text-centered performance-strong">0.817</td>
</tr>
<tr>
<td>OpenAI gpt-4o</td>
<td class="has-text-centered performance-strong">0.755</td>
<td class="has-text-centered">0.793</td>
<td class="has-text-centered performance-strong">0.773</td>
<td class="has-text-centered">0.795</td>
<td class="has-text-centered">0.840</td>
<td class="has-text-centered performance-medium">0.816</td>
</tr>
<tr>
<td>OpenAI o1-mini</td>
<td class="has-text-centered">0.731</td>
<td class="has-text-centered">0.801</td>
<td class="has-text-centered">0.763</td>
<td class="has-text-centered">0.795</td>
<td class="has-text-centered">0.840</td>
<td class="has-text-centered performance-medium">0.816</td>
</tr>
</tbody>
</table>
<div class="content is-small mt-4">
<p><strong>Note:</strong> Color highlighting indicates performance ranking:
<span class="performance-best">&nbsp;Best&nbsp;</span>,
<span class="performance-medium">&nbsp;Strong&nbsp;</span>,
<span class="performance-low">&nbsp;Good&nbsp;</span>
</p>
</div>
</div>
</div> </div>
</div>
</section>
<footer class="footer">
<div class="content has-text-centered">
<p>
<strong>FLaME</strong> by <a href="https://github.com/flame-benchmark/flame">The FLaME Team</a>.
The source code is available on <a href="https://github.com/flame-benchmark/flame">GitHub</a>.
</p>
<p>
<a href="https://huggingface.co/spaces/flame-benchmark/flame">HuggingFace Space</a> |
<a href="https://arxiv.org/abs/2402.14017">ArXiv Paper</a>
</p>
</div>
</footer>
<script src="static/js/results.js"></script>
</body>
</html>