rzanoli commited on
Commit
602e1b0
·
1 Parent(s): 7aacef3

Small Changes

Browse files
Files changed (2) hide show
  1. src/about.py +2 -2
  2. src/tasks.py +10 -10
src/about.py CHANGED
@@ -93,8 +93,8 @@ TITLE = """<h1 align="center" id="space-title">🚀 EVALITA-LLM Leaderboard 🚀
93
  INTRODUCTION_TEXT = """
94
  Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding translation issues and potential cultural biases; (ii) the benchmark includes **generative** tasks, enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer evaluation.
95
 
96
- **<small>Multiple Choice:</small>** <small> 📊TE (Textual Entailment), 😃SA (Sentiment Analysis), ⚠️HS (Hate Speech Detection), 🏥AT (Admission Test), 🔤WIC (Word in Context), ❓FAQ (Frequently Asked Questions) </small><br>
97
- **<small>Generative:</small>** <small>🔄LS (Lexical Substitution), 📝SU (Summarization), 🏷️NER (Named Entity Recognition), 🔗REL (Relation Extraction) </small>
98
  """
99
 
100
  # Which evaluations are you running? how can people reproduce what you have?
 
93
  INTRODUCTION_TEXT = """
94
  Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding translation issues and potential cultural biases; (ii) the benchmark includes **generative** tasks, enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer evaluation.
95
 
96
+ **<small>Multiple-choice tasks:</small>** <small> 📊TE (Textual Entailment), 😃SA (Sentiment Analysis), ⚠️HS (Hate Speech Detection), 🏥AT (Admission Test), 🔤WIC (Word in Context), ❓FAQ (Frequently Asked Questions) </small><br>
97
+ **<small>Generative tasks:</small>** <small>🔄LS (Lexical Substitution), 📝SU (Summarization), 🏷️NER (Named Entity Recognition), 🔗REL (Relation Extraction) </small>
98
  """
99
 
100
  # Which evaluations are you running? how can people reproduce what you have?
src/tasks.py CHANGED
@@ -23,7 +23,7 @@ Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on
23
  MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
24
 
25
  # Tasks Descriptions
26
- TE_DESCRIPTION = """### Textual Entailment (TE) *(Multiple Choice)*
27
  The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
28
 
29
  | # | Prompt | Answer Choices |
@@ -39,7 +39,7 @@ TE_DESCRIPTION = """### Textual Entailment (TE) *(Multiple Choice)*
39
 
40
  """
41
 
42
- SA_DESCRIPTION = """### Sentiment Analysis (SA) *(Multiple Choice)*
43
  The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
44
 
45
  | # | Prompt | Answer Choices |
@@ -55,7 +55,7 @@ SA_DESCRIPTION = """### Sentiment Analysis (SA) *(Multiple Choice)*
55
 
56
  """
57
 
58
- HS_DESCRIPTION = """### Hate Speech (HS) *(Multiple Choice)*
59
  The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
60
 
61
  | # | Prompt | Answer Choices |
@@ -71,7 +71,7 @@ HS_DESCRIPTION = """### Hate Speech (HS) *(Multiple Choice)*
71
 
72
  """
73
 
74
- AT_DESCRIPTION = """### Admission Tests (AT) *(Multiple Choice)*
75
  The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
76
 
77
  | # | Prompt | Answer Choices |
@@ -87,7 +87,7 @@ AT_DESCRIPTION = """### Admission Tests (AT) *(Multiple Choice)*
87
 
88
  """
89
 
90
- WIC_DESCRIPTION = """### Word in Context (WIC) *(Multiple Choice)*
91
  The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
92
 
93
  | # | Prompt | Answer Choices |
@@ -103,7 +103,7 @@ WIC_DESCRIPTION = """### Word in Context (WIC) *(Multiple Choice)*
103
 
104
  """
105
 
106
- FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *(Multiple Choice)*
107
  The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
108
 
109
  | # | Prompt | Answer Choices |
@@ -119,7 +119,7 @@ FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *
119
 
120
  """
121
 
122
- LS_DESCRIPTION = """### Lexical Substitution (LS) *(Generative)*
123
  The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
124
 
125
  | # | Prompt |
@@ -131,7 +131,7 @@ LS_DESCRIPTION = """### Lexical Substitution (LS) *(Generative)*
131
 
132
  """
133
 
134
- SU_DESCRIPTION = """### Summarization (SUM) *(Generative)*
135
  The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
136
 
137
  | # | Prompt |
@@ -143,7 +143,7 @@ SU_DESCRIPTION = """### Summarization (SUM) *(Generative)*
143
 
144
  """
145
 
146
- NER_DESCRIPTION = """### Named Entity Recognition (NER) *(Generative)*
147
  The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
148
 
149
  | # | Prompt |
@@ -155,7 +155,7 @@ NER_DESCRIPTION = """### Named Entity Recognition (NER) *(Generative)*
155
 
156
  """
157
 
158
- REL_DESCRIPTION = """### Relation Extraction (REL) *(Generative)*
159
  The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
160
 
161
  | # | Prompt |
 
23
  MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
24
 
25
  # Tasks Descriptions
26
+ TE_DESCRIPTION = """### Textual Entailment (TE) --- *Multiple-choice task*
27
  The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
28
 
29
  | # | Prompt | Answer Choices |
 
39
 
40
  """
41
 
42
+ SA_DESCRIPTION = """### Sentiment Analysis (SA) --- *Multiple-choice task*
43
  The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
44
 
45
  | # | Prompt | Answer Choices |
 
55
 
56
  """
57
 
58
+ HS_DESCRIPTION = """### Hate Speech (HS) --- *Multiple-choice task*
59
  The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
60
 
61
  | # | Prompt | Answer Choices |
 
71
 
72
  """
73
 
74
+ AT_DESCRIPTION = """### Admission Tests (AT) --- *Multiple-choice task*
75
  The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
76
 
77
  | # | Prompt | Answer Choices |
 
87
 
88
  """
89
 
90
+ WIC_DESCRIPTION = """### Word in Context (WIC) --- *Multiple-choice task*
91
  The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
92
 
93
  | # | Prompt | Answer Choices |
 
103
 
104
  """
105
 
106
+ FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) --- *Multiple-choice task*
107
  The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
108
 
109
  | # | Prompt | Answer Choices |
 
119
 
120
  """
121
 
122
+ LS_DESCRIPTION = """### Lexical Substitution (LS) --- *Generative task*
123
  The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
124
 
125
  | # | Prompt |
 
131
 
132
  """
133
 
134
+ SU_DESCRIPTION = """### Summarization (SUM) --- *Generative task*
135
  The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
136
 
137
  | # | Prompt |
 
143
 
144
  """
145
 
146
+ NER_DESCRIPTION = """### Named Entity Recognition (NER) --- *Generative task*
147
  The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
148
 
149
  | # | Prompt |
 
155
 
156
  """
157
 
158
+ REL_DESCRIPTION = """### Relation Extraction (REL) --- *Generative task*
159
  The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
160
 
161
  | # | Prompt |