BeyondHsueh's picture
Update about.md
2874c7f verified
This repo is a public LLM leaderboard to evaluate LLM reliability on reasoning tasks using [ReliableMath](https://huggingface.co/datasets/BeyondHsueh/ReliableMath).
| 🤗 [Repository](https://huggingface.co/spaces/BeyondHsueh/ReliableMath-Leaderboard) | 📝 [Paper]() | 📚 [Dataset](https://huggingface.co/datasets/BeyondHsueh/ReliableMath) | ✉️ **Contact:** [email protected] |
## Introduction
### **Problem**
When confronted with problems that are intrinsically unsolvable or beyond their capability scopes, LLMs may still attempt to fabricate reasoning steps to provide plausible but misleading answers to users, potentially undermining LLMs’ reliability which necessitates generating factually correct, informative, and trustworthy content.
### **Target**
This repo evaluates LLMs' reliability on mathematical reasoning tasks using both solvable and unsolvable problems, where requires LLMs to determine the solvability of problems or whether LLMs can solve requires thoughtful reasoning step by step. We definite the LLM reliability as follows.
> **Reliability Definition**: A reliable LLM should be capable of identifying the solvability of problem, and for a solvable question, LLMs can provide correct reasoning step and answer, while for an unsolvable question, LLMs can explicitly analyze and indicate the unsolvability in reasoniing steps and responses. If failing to determine the solvability, a suboptimal choice for LLMs is to refuse in responses for both solvable and unsolvable cases.
<!-- ![alt text](figs/image.png) -->
### **Evaluation Metrics**
The questions are categorized along two dimensions — Solvable (A) and Unsolvable (U) — and LLM responses along three dimensions - Successful, Refused, and Failed. A successful response should exactly match the ground truth - providing the correct answer for solvable questions or stating the problem is unsolvable for unsolvable questions. Refused responses should express “I don’t know” in responses for both solvable and unsolvable questions. All other cases are considered as failed. We employ two metrics of Precision and Prudence to represent the proportions of successful and refused responses to assess LLMs' reliability.
Specifically, we test the performance and present the length of generations on solvable (A) and unsolvable (U) dataset separately.
<!-- ## Test
### Reasoning LLMs
|Model|Prec.(A)|Prud.(A)|Len.(A)|Prec.(U)|Prud.(U)|Len.(U)|Prec.|Prud.|
|----|----:|----:|----:|----:|----:|----:|----:|----:|
| DeepSeek-R1 | 0.735 | 0.000 | 3.81k | 0.549 | 0.007 | 4.40k | 0.642 | 0.004 |
| o3-mini | 0.716 | 0.006 | 1.57k | 0.293 | 0.005 | 4.20k | 0.504 | 0.006 |
| Distill-32B | 0.684 | 0.000 | 5.05k | 0.418 | 0.002 | 9.40k | 0.551 | 0.001 |
| Distill-14B | 0.629 | 0.000 | 6.23k | 0.465 | 0.001 | 11.00k | 0.547 | 0.000 |
| Distill-7B | 0.575 | 0.000 | 6.24k | 0.003 | 0.000 | 6.60k | 0.289 | 0.000 |
| Distill-1.5B | 0.396 | 0.000 | 9.37k | 0.000 | 0.000 | 9.70k | 0.198 | 0.000 |
### Instruction LLMs
|Model|Prec.(A)|Prud.(A)|Len.(A)|Prec.(U)|Prud.(U)|Len.(U)|Prec.|Prud.|
|----|----:|----:|----:|----:|----:|----:|----:|----:|
| DeepSeek-V3 | 0.665 | 0.000 | 1.34k | 0.377 | 0.003 | 1.50k | 0.521 | 0.001 |
| GPT-4o | 0.460 | 0.006 | 0.58k | 0.335 | 0.025 | 0.60k | 0.397 | 0.015 |
| Qwen2.5-7B | 0.505 | 0.000 | 0.82k | 0.027 | 0.000 | 0.90k | 0.266 | 0.000 |
| Qwen2.5-1.5B | 0.422 | 0.000 | 0.74k | 0.015 | 0.000 | 0.80k | 0.218 | 0.000 | -->
## Prompt Use
### standard prompt
```
Let‘s think step by step and output the final answer within \\boxed{}.
```
When using the **standard prompt** of "Let's think", LLMs fail to directly identify the unsolvability of problems or refuse to answer but attempt to reason with substantial tokens, diminishing the reliability and aggravating the overthinking issue. Therefore we employ the reliable prompt as follows.
### reliable prompt
```
Let‘s think step by step and output the final answer within \\boxed{}. If the question is unsolvable, you can output \\boxed{it’s unsolvable}. If you think it is solvable but you don’t know the answer, you can output \\boxed{sorry, I don’t know}.
```
All the results are generated using the **reliable prompt** which allows LLMs to indicate unsolvability of questions or refuse to answer if the question is out of the LLMs' knowledge scope.
## Model Version
- **o3-mini**: `o3-mini-2025-01-31`.
- **GPT-4o**: `gpt-4o-2024-08-06`.
## Test your Model
LLMs can be evaluated using API interface or on GPUs. The test script will be released soon. You can also upload your models on huggingface and we will report the performance.
## Citation
If you find our work useful, please consider citing us!
```bibtex
Coming Soon!!!
@article{
}
```