model size prompt Prec.Avg Prud.Avg Prec.(A) Prud.(A) Len.(A) Prec.(U) Prud.(U) Len.(U) deepseek-ai/DeepSeek-R1 671 Reliable 0.642 0.004 0.735 0.000 3.81k 0.549 0.007 4.40k OpenAI/o3-mini ??? Reliable 0.504 0.006 0.716 0.006 1.57k 0.293 0.005 4.20k deepseek-ai/DeepSeek-V3 671 Reliable 0.521 0.001 0.665 0.000 1.34k 0.377 0.003 1.50k OpenAI/GPT-4o ??? Reliable 0.397 0.015 0.460 0.006 0.58k 0.335 0.025 0.60k deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 32 Reliable 0.551 0.001 0.684 0.000 5.05k 0.418 0.002 9.40k deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 14 Reliable 0.547 0.000 0.629 0.000 6.23k 0.465 0.001 11.00k deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 7 Reliable 0.289 0.000 0.575 0.000 6.24k 0.003 0.000 6.60k deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 1.5 Reliable 0.198 0.000 0.396 0.000 9.37k 0.000 0.000 9.70k Qwen/Qwen3-235B-A22B 235 Reliable 0.621 0.001 0.767 0.000 5.64k 0.475 0.003 5.60k Qwen/Qwen3-32B 32 Reliable 0.545 0.000 0.764 0.000 5.88k 0.326 0.000 6.00k Qwen/Qwen3-14B 14 Reliable 0.573 0.002 0.748 0.003 5.87k 0.399 0.000 6.10k Qwen/Qwen2.5-Math-7B-Instruct 7 Reliable 0.266 0.000 0.505 0.000 0.82k 0.027 0.000 0.90k Qwen/Qwen2.5-Math-1.5B-Instruct 1.5 Reliable 0.218 0.000 0.422 0.000 0.74k 0.015 0.000 0.80k