Papers
arxiv:2406.12644

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Published on Jun 18, 2024
· Submitted by amanchadha on Jun 19, 2024
Authors:
,
,

Abstract

Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.

Community

Paper author Paper submitter

The paper introduces the Hierarchical Prompting Taxonomy (HPT) and Adaptive Hierarchical Prompt Framework (HPF) to provide a comprehensive and automated evaluation of large language models (LLMs) based on task complexity.

  • Hierarchical Prompting Taxonomy (HPT): Proposes a set of rules to establish a universal measure of task complexity for datasets and LLMs, including five unique prompting strategies.
  • Adaptive Hierarchical Prompt Framework (HPF): Introduces an adaptive framework that dynamically selects appropriate prompting strategies based on task complexity, automating the evaluation process.
  • Experimental Validation: Evaluates four instruction-tuned LLMs across four datasets, demonstrating the effectiveness of HPT and HPF in providing reliable and nuanced assessments of LLM capabilities .

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.12644 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.12644 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.12644 in a Space README.md to link it from this page.

Collections including this paper 1