kg_llm_leaderboard_test

Runtime error

File size: 1,777 Bytes

7ae1238
 
8c31abc
7ae1238
 
8c31abc
7ae1238
8c31abc
7ae1238
 
 
8c31abc
 
 
 
 
 
 
 
 
7ae1238
 
 
8c31abc
7ae1238
 
 
8c31abc
ee6bd36
 
 
 
7ae1238
8c31abc



TITLE = """<h1 align="center" id="space-title"> KG LLM Leaderboard</h1>"""

INTRODUCTION_TEXT = f"""
🐨 KG LLM Leaderboard aims to track, rank, and evaluate the performance of released Large Language Models on traditional KBQA/KGQA datasets.

The data on this page is sourced from a research paper. If you intend to use the data from this page, please remember to cite the following source: https://arxiv.org/abs/2303.07992
"""

LLM_BENCHMARKS_TEXT = f"""
ChatGPT is a powerful large language model (LLM) that
covers knowledge resources such as Wikipedia and supports natural language question answering using its own knowledge. Therefore, there is
growing interest in exploring whether ChatGPT can replace traditional
knowledge-based question answering (KBQA) models. Although there
have been some works analyzing the question answering performance of
ChatGPT, there is still a lack of large-scale, comprehensive testing of various types of complex questions to analyze the limitations of the model.
In this paper, we present a framework that follows the black-box testing specifications of CheckList proposed by Microsoft. We evaluate ChatGPT
and its family of LLMs on eight real-world KB-based complex question answering datasets, which include six English datasets and two multilingual datasets.
The total number of test cases is approximately 190,000.

"""



CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@article{tan2023evaluation,
    title={Evaluation of ChatGPT as a question answering system for answering complex questions},
    author={Tan, Yiming and Min, Dehai and Li, Yu and Li, Wenbo and Hu, Nan and Chen, Yongrui and Qi, Guilin},
    journal={arXiv preprint arXiv:2303.07992},
    year={2023}
}
"""