<\!DOCTYPE html> FLaME Results

FLaME: Financial Language Model Evaluation Results

This page presents the results of the FLaME evaluation across various financial NLP tasks. Each tab shows performance metrics for different task categories.

Overall Performance Across All Tasks

Model Information Retrieval * Sentiment Analysis Causal Analysis Text Classification Question Answering Summarization
Dataset FiNER FR RD FNXL FE FiQA SQA FPB CD CC B77 FB FOMC NC HL CFQA FinQA TQA ECTSum EDTSum
Metric Used F1 Score MSE F1 Score Accuracy BERTScore F1
Llama 3 70B Instruct .701.332.883.020 .469 .123.535.902 .142.192 .645.309.652.386.811 .709.809.772 .754.817
Llama 3 8B Instruct .565.289.705.003 .350 .161.600.698 .049.234 .512.659.497.511.763 .268.767.706 .757.811
DBRX Instruct .489.304.778.009 .006 .160.436.499 .087.231 .574.483.193.319.746 .252.738.633 .729.806
DeepSeek LLM (67B) .745.334.879.007 .416 .118.462.811 .025.193 .578.492.407.151.778 .174.742.355 .681.807
Gemma 2 27B .761.356.902.006 .298 .100.515.884 .133.242 .621.538.620.408.808 .268.768.734 .723.814
Gemma 2 9B .651.331.892.005 .367 .189.491.940 .105.207 .609.541.519.365.856 .292.779.750 .585.817
Mistral (7B) Instruct v0.3 .526.276.771.004 .368 .135.522.841 .052.227 .528.503.542.412.779 .199.655.553 .750.811
Mixtral-8x22B Instruct .635.367.811.009 .435 .221.510.776 .125.308 .602.221.465.513.835 .285.766.666 .758.815
Mixtral-8x7B Instruct .598.282.845.009 .267 .208.498.893 .055.229 .547.396.603.583.805 .315.611.501 .747.810
Qwen 2 Instruct (72B) .748.348.854.012 .483 .205.576.901 .190.184 .627.495.605.639.830 .269.819.715 .752.811
WizardLM-2 8x22B .744.355.852.008 .226 .129.566.779 .114.201 .648.500.505.272.797 .247.796.725 .735.808
DeepSeek-V3 .790.437.934.045 .549 .150.583.814 .198.170 .714.487.578.675.729 .261.840.779 .750.815
DeepSeek R1 .807.393.952.057 .587 .110.499.902 .337.202 .763.419.670.688.769 .853.836.858 .759.804
QwQ-32B-Preview .685.270.656.001 .005 .141.550.815 .131.220 .613.784.555.020.744 .282.793.796 .696.817
Jamba 1.5 Mini .552.284.844.005 .132 .119.418.765 .043.270 .508.898.499.151.682 .218.666.586 .741.816
Jamba 1.5 Large .693.341.862.005 .397 .183.582.798 .074.176 .628.618.550.541.782 .225.790.660 .734.818
Claude 3.5 Sonnet .799.439.891.047 .655 .101.553.944 .196.197 .668.634.674.692.827 .402.844.700 .767.813
Claude 3 Haiku .711.285.883.015 .494 .167.463.908 .081.200 .622.022.631.558.781 .421.803.733 .646.808
Cohere Command R 7B .748.194.845.018 .441 .164.532.840 .057.255 .516.762.459.068.770 .212.709.716 .750.815
Cohere Command R + .756.333.922.021 .452 .106.533.699 .080.238 .651.684.393.118.812 .259.776.698 .751.810
Google Gemini 1.5 Pro .712.374.944.019 .393 .144.593.885 .196.217 .418.336.579.525.837 .280.829.763 .777.817
OpenAI gpt-4o .766.399.942.037 .523 .184.541.928 .130.222 .710.524.664.750.824 .749.836.754 .773.816
OpenAI o1-mini .761.403.876.010 .662 .120.542.917 .289.209 .670.612.635.720.769 .840.799.698 .763.816

Causal Analysis Results

Model Causal Detection Causal Classification
Accuracy Precision Recall F1 Precision Recall F1 Accuracy
Llama 3 70B Instruct 0.148 0.429 0.148 0.142 0.241 0.329 0.192 0.198
Llama 3 8B Instruct 0.097 0.341 0.097 0.049 0.232 0.241 0.234 0.380
DBRX Instruct 0.078 0.521 0.078 0.087 0.276 0.313 0.231 0.235
DeepSeek LLM (67B) 0.026 0.214 0.026 0.025 0.141 0.328 0.193 0.221
Gemma 2 27B 0.115 0.510 0.115 0.133 0.309 0.310 0.242 0.262
Gemma 2 9B 0.115 0.394 0.115 0.105 0.275 0.294 0.207 0.258
Mistral (7B) Instruct v0.3 0.078 0.455 0.078 0.052 0.339 0.361 0.227 0.258
Mixtral-8x22B Instruct 0.131 0.486 0.131 0.125 0.344 0.310 0.308 0.318
Mixtral-8x7B Instruct 0.088 0.510 0.088 0.055 0.308 0.314 0.229 0.273
Qwen 2 Instruct (72B) 0.139 0.489 0.139 0.190 0.208 0.330 0.184 0.188
WizardLM-2 8x22B 0.076 0.453 0.076 0.114 0.263 0.347 0.201 0.237
DeepSeek-V3 0.164 0.528 0.164 0.198 0.194 0.327 0.170 0.248
DeepSeek R1 0.245 0.643 0.245 0.337 0.385 0.318 0.202 0.221
QwQ-32B-Preview 0.110 0.473 0.110 0.131 0.193 0.262 0.220 0.465
Jamba 1.5 Mini 0.050 0.280 0.050 0.043 0.323 0.283 0.270 0.295
Jamba 1.5 Large 0.076 0.517 0.076 0.074 0.268 0.248 0.176 0.200
Claude 3.5 Sonnet 0.154 0.564 0.154 0.196 0.259 0.336 0.197 0.235
Claude 3 Haiku 0.082 0.388 0.082 0.081 0.369 0.347 0.200 0.203
Cohere Command R 7B 0.089 0.363 0.089 0.057 0.379 0.356 0.255 0.275
Cohere Command R + 0.090 0.453 0.090 0.080 0.353 0.336 0.238 0.265
Google Gemini 1.5 Pro 0.165 0.514 0.165 0.196 0.265 0.357 0.217 0.258
OpenAI gpt-4o 0.082 0.576 0.082 0.130 0.254 0.327 0.222 0.235
OpenAI o1-mini 0.206 0.648 0.206 0.289 0.325 0.316 0.209 0.233

Note: Color highlighting indicates performance ranking:  Best ,  Strong 

Information Retrieval Task Results

Model FiNER FinRed ReFiND FNXL FinEntity
Precision Recall F1 Accuracy Accuracy Precision Recall F1 Accuracy Precision Recall F1 Precision Recall F1 Accuracy Precision Recall Accuracy F1
Llama 3 70B Instruct 0.7150.6930.7010.911 0.3140.4540.3140.332 0.8790.9040.8790.883 0.0150.0300.0200.010 0.4740.4850.4850.469
Llama 3 8B Instruct 0.5810.5580.5650.854 0.2960.3570.2960.289 0.7230.7550.7230.705 0.0030.0040.0030.002 0.3010.4780.4780.350
DBRX Instruct 0.5160.4760.4890.802 0.3290.3710.3290.304 0.7660.8250.7660.778 0.0080.0110.0090.005 0.0040.0140.0140.006
DeepSeek LLM (67B) 0.7520.7420.7450.917 0.3440.4030.3440.334 0.8740.8900.8740.879 0.0050.0090.0070.003 0.4560.4050.4050.416
Gemma 2 27B 0.7720.7540.7610.923 0.3520.4370.3520.356 0.8970.9140.8970.902 0.0050.0080.0060.003 0.3200.2950.2950.298
Gemma 2 9B 0.6650.6430.6510.886 0.3360.3730.3360.331 0.8850.9020.8850.892 0.0040.0080.0050.003 0.3480.4190.4190.367
Mistral (7B) Instruct v0.3 0.5400.5220.5260.806 0.2780.3830.2780.276 0.7670.8170.7670.771 0.0040.0060.0040.002 0.3370.4770.4770.368
Mixtral-8x22B Instruct 0.6530.6250.6350.870 0.3810.4140.3810.367 0.8070.8470.8070.811 0.0100.0080.0090.005 0.4280.4810.4810.435
Mixtral-8x7B Instruct 0.6130.5910.5980.875 0.2910.3760.2910.282 0.8400.8630.8400.845 0.0070.0120.0090.005 0.2510.3240.3240.267
Qwen 2 Instruct (72B) 0.7660.7420.7480.899 0.3650.4070.3650.348 0.8500.8810.8500.854 0.0100.0160.0120.006 0.4680.5300.5300.483
WizardLM-2 8x22B 0.7550.7410.7440.920 0.3620.3970.3620.355 0.8460.8740.8460.852 0.0080.0090.0080.004 0.2220.2470.2470.226
DeepSeek-V3 0.7980.7870.7900.945 0.4500.4630.4500.437 0.9270.9430.9270.934 0.0340.0670.0450.023 0.5630.5440.5440.549
DeepSeek R1 0.8130.8050.8070.944 0.4120.4240.4120.393 0.9460.9600.9460.952 0.0440.0820.0570.029 0.6000.5860.5860.587
QwQ-32B-Preview 0.6950.6810.6850.907 0.2780.3960.2780.270 0.6800.7700.6800.656 0.0010.0010.0010.000 0.0050.0050.0050.005
Jamba 1.5 Mini 0.5640.5560.5520.818 0.3080.4500.3080.284 0.8300.8640.8300.844 0.0040.0060.0050.003 0.1190.1820.1820.132
Jamba 1.5 Large 0.7070.6870.6930.883 0.3410.4520.3410.341 0.8560.8900.8560.862 0.0040.0050.0050.002 0.4030.4140.4140.397
Claude 3.5 Sonnet 0.8110.7940.7990.922 0.4550.4650.4550.439 0.8730.9270.8730.891 0.0340.0800.0470.024 0.6580.6680.6680.655
Claude 3 Haiku 0.7320.7000.7110.895 0.2940.3300.2940.285 0.8790.9170.8790.883 0.0110.0220.0150.008 0.4980.5170.5170.494
Cohere Command R + 0.7690.7500.7560.902 0.3530.4050.3530.333 0.9170.9300.9170.922 0.0160.0320.0210.011 0.4620.4590.4590.452
Google Gemini 1.5 Pro 0.7280.7050.7120.891 0.3730.4360.3730.374 0.9340.9550.9340.944 0.0140.0280.0190.010 0.3990.4000.4000.393
OpenAI gpt-4o 0.7780.7600.7660.911 0.4020.4450.4020.399 0.9310.9550.9310.942 0.0270.0560.0370.019 0.5370.5170.5170.523
OpenAI o1-mini 0.7720.7550.7610.922 0.4070.4440.4070.403 0.8670.9000.8670.876 0.0070.0150.0100.005 0.6610.6810.6810.662

Note: Color highlighting indicates performance ranking:  Best ,  Strong 

Question Answering Task Results

Model Datasets (Accuracy)
FinQA ConvFinQA TATQA
Llama 3 70B Instruct 0.809 0.709 0.772
Llama 3 8B Instruct 0.767 0.268 0.706
DBRX Instruct 0.738 0.252 0.633
DeepSeek LLM (67B) 0.742 0.174 0.355
Gemma 2 27B 0.768 0.268 0.734
Gemma 2 9B 0.779 0.292 0.750
Mistral (7B) Instruct v0.3 0.655 0.199 0.553
Mixtral-8x22B Instruct 0.766 0.285 0.666
Mixtral-8x7B Instruct 0.611 0.315 0.501
Qwen 2 Instruct (72B) 0.819 0.269 0.715
WizardLM-2 8x22B 0.796 0.247 0.725
DeepSeek-V3 0.840 0.261 0.779
DeepSeek R1 0.836 0.853 0.858
QwQ-32B-Preview 0.793 0.282 0.796
Jamba 1.5 Mini 0.666 0.218 0.586
Jamba 1.5 Large 0.790 0.225 0.660
Claude 3.5 Sonnet 0.844 0.402 0.700
Claude 3 Haiku 0.803 0.421 0.733
Cohere Command R 7B 0.709 0.212 0.716
Cohere Command R + 0.776 0.259 0.698
Google Gemini 1.5 Pro 0.829 0.280 0.763
OpenAI gpt-4o 0.836 0.749 0.754
OpenAI o1-mini 0.799 0.840 0.698

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Sentiment Analysis Task Results

Model FiQA Task 1 Financial Phrase Bank (FPB) SubjECTive-QA
MSE MAE r² Score Accuracy Precision Recall F1 Precision Recall F1 Accuracy
Llama 3 70B Instruct 0.123 0.290 0.272 0.901 0.904 0.901 0.902 0.652 0.573 0.535 0.573
Llama 3 8B Instruct 0.161 0.344 0.045 0.738 0.801 0.738 0.698 0.635 0.625 0.600 0.625
DBRX Instruct 0.160 0.321 0.052 0.524 0.727 0.524 0.499 0.654 0.541 0.436 0.541
DeepSeek LLM (67B) 0.118 0.278 0.302 0.815 0.867 0.815 0.811 0.676 0.544 0.462 0.544
Gemma 2 27B 0.100 0.266 0.406 0.890 0.896 0.890 0.884 0.562 0.524 0.515 0.524
Gemma 2 9B 0.189 0.352 -0.120 0.940 0.941 0.940 0.940 0.570 0.499 0.491 0.499
Mistral (7B) Instruct v0.3 0.135 0.278 0.200 0.847 0.854 0.847 0.841 0.607 0.542 0.522 0.542
Mixtral-8x22B Instruct 0.221 0.364 -0.310 0.768 0.845 0.768 0.776 0.614 0.538 0.510 0.538
Mixtral-8x7B Instruct 0.208 0.307 -0.229 0.896 0.898 0.896 0.893 0.611 0.518 0.498 0.518
Qwen 2 Instruct (72B) 0.205 0.409 -0.212 0.904 0.908 0.904 0.901 0.644 0.601 0.576 0.601
WizardLM-2 8x22B 0.129 0.283 0.239 0.765 0.853 0.765 0.779 0.611 0.570 0.566 0.570
DeepSeek-V3 0.150 0.311 0.111 0.828 0.851 0.828 0.814 0.640 0.572 0.583 0.572
DeepSeek R1 0.110 0.289 0.348 0.904 0.907 0.904 0.902 0.644 0.489 0.499 0.489
QwQ-32B-Preview 0.141 0.290 0.165 0.812 0.827 0.812 0.815 0.629 0.534 0.550 0.534
Jamba 1.5 Mini 0.119 0.282 0.293 0.784 0.814 0.784 0.765 0.380 0.525 0.418 0.525
Jamba 1.5 Large 0.183 0.363 -0.085 0.824 0.850 0.824 0.798 0.635 0.573 0.582 0.573
Claude 3.5 Sonnet 0.101 0.268 0.402 0.944 0.945 0.944 0.944 0.634 0.585 0.553 0.585
Claude 3 Haiku 0.167 0.349 0.008 0.907 0.913 0.907 0.908 0.619 0.538 0.463 0.538
Cohere Command R 7B 0.164 0.319 0.028 0.835 0.861 0.835 0.840 0.609 0.547 0.532 0.547
Cohere Command R + 0.106 0.274 0.373 0.741 0.806 0.741 0.699 0.608 0.547 0.533 0.547
Google Gemini 1.5 Pro 0.144 0.329 0.149 0.890 0.895 0.890 0.885 0.642 0.587 0.593 0.587
OpenAI gpt-4o 0.184 0.317 -0.089 0.929 0.931 0.929 0.928 0.639 0.515 0.541 0.515
OpenAI o1-mini 0.120 0.295 0.289 0.918 0.917 0.918 0.917 0.660 0.515 0.542 0.515

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Text Classification Task Results

Model Banking77 FinBench FOMC NumClaim Headlines
Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy
Llama 3 70B Instruct 0.660 0.748 0.660 0.645 0.222 0.826 0.222 0.309 0.661 0.662 0.661 0.652 0.430 0.240 0.980 0.386 0.811
Llama 3 8B Instruct 0.534 0.672 0.534 0.512 0.543 0.857 0.543 0.659 0.565 0.618 0.565 0.497 0.801 0.463 0.571 0.511 0.763
DBRX Instruct 0.578 0.706 0.578 0.574 0.359 0.851 0.359 0.483 0.285 0.572 0.285 0.193 0.222 0.190 1.000 0.319 0.746
DeepSeek LLM (67B) 0.596 0.711 0.596 0.578 0.369 0.856 0.369 0.492 0.532 0.678 0.532 0.407 0.832 1.000 0.082 0.151 0.778
Gemma 2 27B 0.639 0.730 0.639 0.621 0.410 0.849 0.410 0.538 0.651 0.704 0.651 0.620 0.471 0.257 1.000 0.408 0.808
Gemma 2 9B 0.630 0.710 0.630 0.609 0.412 0.848 0.412 0.541 0.595 0.694 0.595 0.519 0.371 0.224 0.990 0.365 0.856
Mistral (7B) Instruct v0.3 0.547 0.677 0.547 0.528 0.375 0.839 0.375 0.503 0.587 0.598 0.587 0.542 0.521 0.266 0.918 0.412 0.779
Mixtral-8x22B Instruct 0.622 0.718 0.622 0.602 0.166 0.811 0.166 0.221 0.562 0.709 0.562 0.465 0.732 0.384 0.775 0.513 0.835
Mixtral-8x7B Instruct 0.567 0.693 0.567 0.547 0.285 0.838 0.285 0.396 0.623 0.636 0.623 0.603 0.765 0.431 0.898 0.583 0.805
Qwen 2 Instruct (72B) 0.644 0.730 0.644 0.627 0.370 0.848 0.370 0.495 0.623 0.639 0.623 0.605 0.821 0.506 0.867 0.639 0.830
WizardLM-2 8x22B 0.664 0.737 0.664 0.648 0.373 0.842 0.373 0.500 0.583 0.710 0.583 0.505 0.831 0.630 0.173 0.272 0.797
DeepSeek-V3 0.722 0.774 0.722 0.714 0.362 0.845 0.362 0.487 0.625 0.712 0.625 0.578 0.860 0.586 0.796 0.675 0.729
DeepSeek R1 0.772 0.789 0.772 0.763 0.306 0.846 0.306 0.419 0.679 0.682 0.679 0.670 0.851 0.557 0.898 0.688 0.769
QwQ-32B-Preview 0.577 0.747 0.577 0.613 0.716 0.871 0.716 0.784 0.591 0.630 0.591 0.555 0.819 1.000 0.010 0.020 0.744
Jamba 1.5 Mini 0.528 0.630 0.528 0.508 0.913 0.883 0.913 0.898 0.572 0.678 0.572 0.499 0.812 0.429 0.092 0.151 0.682
Jamba 1.5 Large 0.642 0.746 0.642 0.628 0.494 0.851 0.494 0.618 0.597 0.650 0.597 0.550 0.855 0.639 0.469 0.541 0.782
Claude 3.5 Sonnet 0.682 0.755 0.682 0.668 0.513 0.854 0.513 0.634 0.675 0.677 0.675 0.674 0.879 0.646 0.745 0.692 0.827
Claude 3 Haiku 0.639 0.735 0.639 0.622 0.067 0.674 0.067 0.022 0.633 0.634 0.633 0.631 0.838 0.556 0.561 0.558 0.781
Cohere Command R 7B 0.530 0.650 0.530 0.516 0.682 0.868 0.682 0.762 0.536 0.505 0.536 0.459 0.797 0.210 0.041 0.068 0.770
Cohere Command R + 0.660 0.747 0.660 0.651 0.575 0.859 0.575 0.684 0.526 0.655 0.526 0.393 0.804 0.333 0.071 0.118 0.812
Google Gemini 1.5 Pro 0.483 0.487 0.483 0.418 0.240 0.823 0.240 0.336 0.619 0.667 0.619 0.579 0.700 0.369 0.908 0.525 0.837
OpenAI gpt-4o 0.704 0.792 0.704 0.710 0.396 0.846 0.396 0.524 0.681 0.719 0.681 0.664 0.896 0.667 0.857 0.750 0.824
OpenAI o1-mini 0.681 0.760 0.681 0.670 0.487 0.851 0.487 0.612 0.651 0.670 0.651 0.635 0.888 0.664 0.786 0.720 0.769

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Text Summarization Task Results

Model ECTSum EDTSum
BERTScore Precision BERTScore Recall BERTScore F1 BERTScore Precision BERTScore Recall BERTScore F1
Llama 3 70B Instruct 0.715 0.801 0.754 0.793 0.844 0.817
Llama 3 8B Instruct 0.724 0.796 0.757 0.785 0.841 0.811
DBRX Instruct 0.680 0.786 0.729 0.774 0.843 0.806
DeepSeek LLM (67B) 0.692 0.678 0.681 0.779 0.840 0.807
Gemma 2 27B 0.680 0.777 0.723 0.801 0.829 0.814
Gemma 2 9B 0.651 0.531 0.585 0.803 0.833 0.817
Mistral (7B) Instruct v0.3 0.702 0.806 0.750 0.783 0.842 0.811
Mixtral-8x22B Instruct 0.713 0.812 0.758 0.790 0.843 0.815
Mixtral-8x7B Instruct 0.727 0.773 0.747 0.785 0.839 0.810
Qwen 2 Instruct (72B) 0.709 0.804 0.752 0.781 0.846 0.811
WizardLM-2 8x22B 0.677 0.806 0.735 0.774 0.847 0.808
DeepSeek-V3 0.703 0.806 0.750 0.791 0.842 0.815
DeepSeek R1 0.724 0.800 0.759 0.770 0.843 0.804
QwQ-32B-Preview 0.653 0.751 0.696 0.797 0.841 0.817
Jamba 1.5 Mini 0.692 0.798 0.741 0.798 0.838 0.816
Jamba 1.5 Large 0.679 0.800 0.734 0.799 0.841 0.818
Claude 3.5 Sonnet 0.737 0.802 0.767 0.786 0.843 0.813
Claude 3 Haiku 0.683 0.617 0.646 0.778 0.844 0.808
Cohere Command R 7B 0.724 0.781 0.750 0.790 0.844 0.815
Cohere Command R + 0.724 0.782 0.751 0.789 0.834 0.810
Google Gemini 1.5 Pro 0.757 0.800 0.777 0.800 0.836 0.817
OpenAI gpt-4o 0.755 0.793 0.773 0.795 0.840 0.816
OpenAI o1-mini 0.731 0.801 0.763 0.795 0.840 0.816

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good