Question Answering Task Results

Model Datasets (Accuracy)
FinQA ConvFinQA TATQA
Llama 3 70B Instruct 0.809 0.709 0.772
Llama 3 8B Instruct 0.767 0.268 0.706
DBRX Instruct 0.738 0.252 0.633
DeepSeek LLM (67B) 0.742 0.174 0.355
Gemma 2 27B 0.768 0.268 0.734
Gemma 2 9B 0.779 0.292 0.750
Mistral (7B) Instruct v0.3 0.655 0.199 0.553
Mixtral-8x22B Instruct 0.766 0.285 0.666
Mixtral-8x7B Instruct 0.611 0.315 0.501
Qwen 2 Instruct (72B) 0.819 0.269 0.715
WizardLM-2 8x22B 0.796 0.247 0.725
DeepSeek-V3 0.840 0.261 0.779
DeepSeek R1 0.836 0.853 0.858
QwQ-32B-Preview 0.793 0.282 0.796
Jamba 1.5 Mini 0.666 0.218 0.586
Jamba 1.5 Large 0.790 0.225 0.660
Claude 3.5 Sonnet 0.844 0.402 0.700
Claude 3 Haiku 0.803 0.421 0.733
Cohere Command R 7B 0.709 0.212 0.716
Cohere Command R + 0.776 0.259 0.698
Google Gemini 1.5 Pro 0.829 0.280 0.763
OpenAI gpt-4o 0.836 0.749 0.754
OpenAI o1-mini 0.799 0.840 0.698

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good