Spaces:

LLM360
/

de-arena

Running

App Files Files Community

yzabc007 commited on Oct 9, 2024

Commit

9f4c149

1 Parent(s): 1062c17

Update space

Browse files

Files changed (1) hide show

app.py +16 -10

app.py CHANGED Viewed

@@ -137,6 +137,9 @@ with demo:
             DESCRIPTION_TEXT = """
             Total #models: 52 (Last updated: 2024-10-08)
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
@@ -160,8 +163,8 @@ with demo:
         with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
             DESCRIPTION_TEXT = """
             Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
-            We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685), coving a wide range of domains, including writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
@@ -192,6 +195,7 @@ with demo:
             [MathQA](https://arxiv.org/abs/1905.13319),
             [MathBench](https://arxiv.org/abs/2405.12209),
             [SciBench](https://arxiv.org/abs/2307.10635), and more!
             We plan to include more math domains, such as calculus, number theory, and more in the future.
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
@@ -250,22 +254,24 @@ with demo:
         with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
             DESCRIPTION_TEXT = """
-            Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective at distinguishing between modern LLMs.
-            Our current focus is on two challenging types of reasoning: logical reasoning and social reasoning, both of which present more meaningful and sophisticated ways to assess LLM performance.
-            For logical reasoning, we collect datasets from
-            [BigBench Hard (BBH)](https://arxiv.org/abs/2210.09261),
             [FOLIO](https://arxiv.org/abs/2209.00840),
             [LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
             [PrOntoQA](https://arxiv.org/abs/2210.01240),
-            [ReClor](https://arxiv.org/abs/2002.04326).
             For social reasoning, we collect datasets from
-            [MMToM-QA](https://arxiv.org/abs/2401.08743),
             [BigToM](https://arxiv.org/abs/2306.15448),
             [Adv-CSFB](https://arxiv.org/abs/2305.14763),
             [SocialIQA](https://arxiv.org/abs/1904.09728),
-            [NormBank](https://arxiv.org/abs/2305.17008).
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")

             DESCRIPTION_TEXT = """
             Total #models: 52 (Last updated: 2024-10-08)
+            This page provids a comprehensive overview of model ranks across various dimensions. Models are sorted based on their averaged rank across all dimensions.
+            (Some missing values are due to the slow or problemtic model responses, and we will update the leaderboard once we have the complete results.)
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
         with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
             DESCRIPTION_TEXT = """
             Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
+            We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685),
+            coving a wide range of domains, including writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
             [MathQA](https://arxiv.org/abs/1905.13319),
             [MathBench](https://arxiv.org/abs/2405.12209),
             [SciBench](https://arxiv.org/abs/2307.10635), and more!
             We plan to include more math domains, such as calculus, number theory, and more in the future.
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
         with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
             DESCRIPTION_TEXT = """
+            Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective in differentiating modern LLMs.
+            We now present two challenging types of reasoning: logical reasoning and social reasoning, both of which present more meaningful and sophisticated ways to assess LLM performance.
+            For logical reasoning, we leverage datasets from sources such as
+            [BIG-Bench Hard (BBH)](https://arxiv.org/abs/2210.09261),
             [FOLIO](https://arxiv.org/abs/2209.00840),
             [LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
             [PrOntoQA](https://arxiv.org/abs/2210.01240),
+            [ReClor](https://arxiv.org/abs/2002.04326),
+            These cover a range of tasks including deductive reasoning, object counting and tracking, pattern recognition,
+            temporal reasoning, first-order logic reaosning, etc.
             For social reasoning, we collect datasets from
+            [MMToM-QA (Text-only)](https://arxiv.org/abs/2401.08743),
             [BigToM](https://arxiv.org/abs/2306.15448),
             [Adv-CSFB](https://arxiv.org/abs/2305.14763),
             [SocialIQA](https://arxiv.org/abs/1904.09728),
+            [NormBank](https://arxiv.org/abs/2305.17008), covering challenging social reasoning tasks,
+            such as social commonsense reasoning, social normative reasoning, Theory of Mind (ToM) reasoning, etc.
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")