Update space
Browse files
app.py
CHANGED
|
@@ -137,6 +137,9 @@ with demo:
|
|
| 137 |
|
| 138 |
DESCRIPTION_TEXT = """
|
| 139 |
Total #models: 52 (Last updated: 2024-10-08)
|
|
|
|
|
|
|
|
|
|
| 140 |
"""
|
| 141 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
| 142 |
|
|
@@ -160,8 +163,8 @@ with demo:
|
|
| 160 |
with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
|
| 161 |
DESCRIPTION_TEXT = """
|
| 162 |
Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
|
| 163 |
-
We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685),
|
| 164 |
-
|
| 165 |
"""
|
| 166 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
| 167 |
|
|
@@ -192,6 +195,7 @@ with demo:
|
|
| 192 |
[MathQA](https://arxiv.org/abs/1905.13319),
|
| 193 |
[MathBench](https://arxiv.org/abs/2405.12209),
|
| 194 |
[SciBench](https://arxiv.org/abs/2307.10635), and more!
|
|
|
|
| 195 |
We plan to include more math domains, such as calculus, number theory, and more in the future.
|
| 196 |
"""
|
| 197 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
|
@@ -250,22 +254,24 @@ with demo:
|
|
| 250 |
|
| 251 |
with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
|
| 252 |
DESCRIPTION_TEXT = """
|
| 253 |
-
Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective
|
| 254 |
-
|
| 255 |
|
| 256 |
-
For logical reasoning, we
|
| 257 |
-
[
|
| 258 |
[FOLIO](https://arxiv.org/abs/2209.00840),
|
| 259 |
[LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
|
| 260 |
[PrOntoQA](https://arxiv.org/abs/2210.01240),
|
| 261 |
-
[ReClor](https://arxiv.org/abs/2002.04326)
|
| 262 |
-
|
|
|
|
| 263 |
For social reasoning, we collect datasets from
|
| 264 |
-
[MMToM-QA](https://arxiv.org/abs/2401.08743),
|
| 265 |
[BigToM](https://arxiv.org/abs/2306.15448),
|
| 266 |
[Adv-CSFB](https://arxiv.org/abs/2305.14763),
|
| 267 |
[SocialIQA](https://arxiv.org/abs/1904.09728),
|
| 268 |
-
[NormBank](https://arxiv.org/abs/2305.17008)
|
|
|
|
| 269 |
|
| 270 |
"""
|
| 271 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
|
|
|
| 137 |
|
| 138 |
DESCRIPTION_TEXT = """
|
| 139 |
Total #models: 52 (Last updated: 2024-10-08)
|
| 140 |
+
|
| 141 |
+
This page provids a comprehensive overview of model ranks across various dimensions. Models are sorted based on their averaged rank across all dimensions.
|
| 142 |
+
(Some missing values are due to the slow or problemtic model responses, and we will update the leaderboard once we have the complete results.)
|
| 143 |
"""
|
| 144 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
| 145 |
|
|
|
|
| 163 |
with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
|
| 164 |
DESCRIPTION_TEXT = """
|
| 165 |
Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
|
| 166 |
+
We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685),
|
| 167 |
+
coving a wide range of domains, including writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
|
| 168 |
"""
|
| 169 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
| 170 |
|
|
|
|
| 195 |
[MathQA](https://arxiv.org/abs/1905.13319),
|
| 196 |
[MathBench](https://arxiv.org/abs/2405.12209),
|
| 197 |
[SciBench](https://arxiv.org/abs/2307.10635), and more!
|
| 198 |
+
|
| 199 |
We plan to include more math domains, such as calculus, number theory, and more in the future.
|
| 200 |
"""
|
| 201 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
|
|
|
| 254 |
|
| 255 |
with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
|
| 256 |
DESCRIPTION_TEXT = """
|
| 257 |
+
Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective in differentiating modern LLMs.
|
| 258 |
+
We now present two challenging types of reasoning: logical reasoning and social reasoning, both of which present more meaningful and sophisticated ways to assess LLM performance.
|
| 259 |
|
| 260 |
+
For logical reasoning, we leverage datasets from sources such as
|
| 261 |
+
[BIG-Bench Hard (BBH)](https://arxiv.org/abs/2210.09261),
|
| 262 |
[FOLIO](https://arxiv.org/abs/2209.00840),
|
| 263 |
[LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
|
| 264 |
[PrOntoQA](https://arxiv.org/abs/2210.01240),
|
| 265 |
+
[ReClor](https://arxiv.org/abs/2002.04326),
|
| 266 |
+
These cover a range of tasks including deductive reasoning, object counting and tracking, pattern recognition,
|
| 267 |
+
temporal reasoning, first-order logic reaosning, etc.
|
| 268 |
For social reasoning, we collect datasets from
|
| 269 |
+
[MMToM-QA (Text-only)](https://arxiv.org/abs/2401.08743),
|
| 270 |
[BigToM](https://arxiv.org/abs/2306.15448),
|
| 271 |
[Adv-CSFB](https://arxiv.org/abs/2305.14763),
|
| 272 |
[SocialIQA](https://arxiv.org/abs/1904.09728),
|
| 273 |
+
[NormBank](https://arxiv.org/abs/2305.17008), covering challenging social reasoning tasks,
|
| 274 |
+
such as social commonsense reasoning, social normative reasoning, Theory of Mind (ToM) reasoning, etc.
|
| 275 |
|
| 276 |
"""
|
| 277 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|