open_cn_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

xuanricheng commited on May 16, 2024

Commit

175efb2

1 Parent(s): f97f2b7

update about

Browse files

Files changed (3) hide show

README.md +1 -1
src/display/about.py +23 -111
src/display/utils.py +6 -6

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Chinese Open LLM Leaderboard
 emoji: 🏆
 colorFrom: green
 colorTo: indigo

 ---
+title:  Open Chinese LLM Leaderboard
 emoji: 🏆
 colorFrom: green
 colorTo: indigo

src/display/about.py CHANGED Viewed

@@ -1,29 +1,37 @@
 from src.display.utils import ModelType
-TITLE = """<h1 align="center" id="space-title">🤗 Open Chinese LLM Leaderboard</h1>"""
 INTRODUCTION_TEXT = """
-📐 The 🤗 Open Chinese LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
-This leaderboard is subset of the [FlagEval](https://flageval.baai.ac.cn/)
-🤗 Submit a model for automated evaluation on the 🤗 GPU cluster on the "Submit" page!
-The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details in the "About" page!
 """
 LLM_BENCHMARKS_TEXT = f"""
 # Context
-With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
 ## How it works
 📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
-- <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
-- <a href="https://arxiv.org/abs/2009.03300" target="_blank">  MMLU </a>  (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
 - <a href="https://arxiv.org/abs/2109.07958" target="_blank">  TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
 - <a href="https://arxiv.org/abs/1907.10641" target="_blank">  Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
 - <a href="https://arxiv.org/abs/2110.14168" target="_blank">  GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
 For all these evaluations, a higher score is a better score.
 We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
@@ -43,12 +51,13 @@ The total batch size we get for models which fit on one A100 node is 8 (8 GPUs *
 *You can expect results to vary slightly for different batch sizes because of padding.*
 The tasks and few shots parameters are:
-- ARC: 25-shot, *arc-challenge* (`acc_norm`)
-- HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
-- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
-- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
-- Winogrande: 5-shot, *winogrande* (`acc`)
-- GSM8k: 5-shot, *gsm8k* (`acc`)
 Side note on the baseline scores:
 - for log-likelihood evaluation, we select the random baseline
@@ -63,14 +72,9 @@ If there is no icon, we have not uploaded the information on the model yet, feel
 "Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
-## Quantization
-To get more information about quantization, see:
-- 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
-- 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
 ## Useful links
 - [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
-- [Collection of best models](https://huggingface.co/collections/open-cn-llm-leaderboard/chinese-llm-leaderboard-best-models-65b0d4511dbd85fd0c3ad9cd)
 """
 FAQ_TEXT = """
@@ -170,96 +174,4 @@ If everything is done, check you can launch the EleutherAIHarness on your model
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
-@misc{open-llm-leaderboard,
-  author = {Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf},
-  title = {Open LLM Leaderboard},
-  year = {2023},
-  publisher = {Hugging Face},
-  howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
-}
-@software{eval-harness,
-  author       = {Gao, Leo and
-                  Tow, Jonathan and
-                  Biderman, Stella and
-                  Black, Sid and
-                  DiPofi, Anthony and
-                  Foster, Charles and
-                  Golding, Laurence and
-                  Hsu, Jeffrey and
-                  McDonell, Kyle and
-                  Muennighoff, Niklas and
-                  Phang, Jason and
-                  Reynolds, Laria and
-                  Tang, Eric and
-                  Thite, Anish and
-                  Wang, Ben and
-                  Wang, Kevin and
-                  Zou, Andy},
-  title        = {A framework for few-shot language model evaluation},
-  month        = sep,
-  year         = 2021,
-  publisher    = {Zenodo},
-  version      = {v0.0.1},
-  doi          = {10.5281/zenodo.5371628},
-  url          = {https://doi.org/10.5281/zenodo.5371628}
-}
-@misc{clark2018think,
-      title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
-      author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
-      year={2018},
-      eprint={1803.05457},
-      archivePrefix={arXiv},
-      primaryClass={cs.AI}
-}
-@misc{zellers2019hellaswag,
-      title={HellaSwag: Can a Machine Really Finish Your Sentence?},
-      author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
-      year={2019},
-      eprint={1905.07830},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-@misc{hendrycks2021measuring,
-      title={Measuring Massive Multitask Language Understanding},
-      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
-      year={2021},
-      eprint={2009.03300},
-      archivePrefix={arXiv},
-      primaryClass={cs.CY}
-}
-@misc{lin2022truthfulqa,
-      title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
-      author={Stephanie Lin and Jacob Hilton and Owain Evans},
-      year={2022},
-      eprint={2109.07958},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-@misc{DBLP:journals/corr/abs-1907-10641,
-      title={{WINOGRANDE:} An Adversarial Winograd Schema Challenge at Scale},
-      author={Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
-      year={2019},
-      eprint={1907.10641},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-@misc{DBLP:journals/corr/abs-2110-14168,
-      title={Training Verifiers to Solve Math Word Problems},
-      author={Karl Cobbe and
-                  Vineet Kosaraju and
-                  Mohammad Bavarian and
-                  Mark Chen and
-                  Heewoo Jun and
-                  Lukasz Kaiser and
-                  Matthias Plappert and
-                  Jerry Tworek and
-                  Jacob Hilton and
-                  Reiichiro Nakano and
-                  Christopher Hesse and
-                  John Schulman},
-      year={2021},
-      eprint={2110.14168},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
 """

 from src.display.utils import ModelType
+TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
 INTRODUCTION_TEXT = """
+Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型（LLM）。本排行榜由FlagEval平台提供相应算力和运行环境。
+评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息，请查阅‘关于’页面。
+如需对模型进行更全面的评测，可以登录FlagEval平台，体验更加完善的模型评测功能。
+The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the [FlagEval](https://flageval.baai.ac.cn/) platform, providing corresponding computational resources and runtime environment.
+The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
+For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
 """
 LLM_BENCHMARKS_TEXT = f"""
 # Context
+Open Chinese LLM Leaderboard是中文大语言排行榜，我们希望能够推动更加开放的生态，让中文大语言模型开发者参与进来，为推动中文的大语言模型进步做出相应的贡献。
+为了实现公平性的目标，所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估，以确保公平性。
+The Open Chinese LLM Leaderboard serves as a ranking platform for major Chinese language models. We aspire to foster a more inclusive ecosystem, inviting developers of Chinese LLMs to contribute to the advancement of the field.
+In pursuit of fairness, all models undergo evaluation on the FlagEval platform using standardized GPU and uniform environments to ensure impartiality.
 ## How it works
 📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
+- <a href="https://arxiv.org/abs/1803.05457" target="_blank">  ARC Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 - <a href="https://arxiv.org/abs/2109.07958" target="_blank">  TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
 - <a href="https://arxiv.org/abs/1907.10641" target="_blank">  Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
 - <a href="https://arxiv.org/abs/2110.14168" target="_blank">  GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
+- <a href="https://flageval.baai.ac.cn/#/taskIntro?t=zh_qa" target="_blank">  C-SEM </a>  (5-shot) - Semantic understanding is seen as a key cornerstone in the research and application of natural language processing. However, there is still a lack of publicly available benchmarks that approach from a linguistic perspective in the field of evaluating large Chinese language models.
+- <a href="https://arxiv.org/abs/2306.09212" target="_blank">  CMMLU </a>  (5-shot) - CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture. CMMLU covers a wide range of subjects, comprising 67 topics that span from elementary to advanced professional levels.
 For all these evaluations, a higher score is a better score.
 We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
 *You can expect results to vary slightly for different batch sizes because of padding.*
 The tasks and few shots parameters are:
+- C-ARC: 25-shot, *arc-challenge* (`acc_norm`)
+- C-HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
+- C-TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
+- C-Winogrande: 5-shot, *winogrande* (`acc`)
+- C-GSM8k: 5-shot, *gsm8k* (`acc`)
+- C-SEM-V2: 5-shot, cmmlu* `acc`)
+- CMMLU: 5-shot, cmmlu* `acc`)
 Side note on the baseline scores:
 - for log-likelihood evaluation, we select the random baseline
 "Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
 ## Useful links
 - [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
 """
 FAQ_TEXT = """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """

src/display/utils.py CHANGED Viewed

@@ -14,13 +14,13 @@ class Task:
     col_name: str
 class Tasks(Enum):
-    arc = Task("arc:challenge", "acc_norm", "C-ARC")
-    hellaswag = Task("hellaswag", "acc_norm", "C-HellaSwag")
-    truthfulqa = Task("truthfulqa:mc", "mc2", "C-TruthfulQA")
-    winogrande = Task("winogrande", "acc", "C-Winogrande")
-    gsm8k = Task("gsm8k", "acc", "C-GSM8K")
     c_sem = Task("c-sem-v2", "acc", "C-SEM")
-    mmlu = Task("cmmlu", "acc", "C-MMLU")
 # These classes are for user facing column names,
 # to avoid having to change them all around the code

     col_name: str
 class Tasks(Enum):
+    arc = Task("c_arc_challenge", "acc_norm", "C-ARC")
+    hellaswag = Task("c_hellaswag", "acc_norm", "C-HellaSwag")
+    truthfulqa = Task("c_truthfulqa_mc", "mc2", "C-TruthfulQA")
+    winogrande = Task("c_winogrande", "acc", "C-Winogrande")
+    gsm8k = Task("c_gsm8k", "acc", "C-GSM8K")
     c_sem = Task("c-sem-v2", "acc", "C-SEM")
+    mmlu = Task("cmmlu", "acc_norm", "C-MMLU")
 # These classes are for user facing column names,
 # to avoid having to change them all around the code