Small Changes
Browse files- src/about.py +2 -2
- src/tasks.py +10 -10
src/about.py
CHANGED
@@ -93,8 +93,8 @@ TITLE = """<h1 align="center" id="space-title">🚀 EVALITA-LLM Leaderboard 🚀
|
|
93 |
INTRODUCTION_TEXT = """
|
94 |
Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding translation issues and potential cultural biases; (ii) the benchmark includes **generative** tasks, enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer evaluation.
|
95 |
|
96 |
-
**<small>Multiple
|
97 |
-
**<small>Generative:</small>** <small>🔄LS (Lexical Substitution), 📝SU (Summarization), 🏷️NER (Named Entity Recognition), 🔗REL (Relation Extraction) </small>
|
98 |
"""
|
99 |
|
100 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
93 |
INTRODUCTION_TEXT = """
|
94 |
Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding translation issues and potential cultural biases; (ii) the benchmark includes **generative** tasks, enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer evaluation.
|
95 |
|
96 |
+
**<small>Multiple-choice tasks:</small>** <small> 📊TE (Textual Entailment), 😃SA (Sentiment Analysis), ⚠️HS (Hate Speech Detection), 🏥AT (Admission Test), 🔤WIC (Word in Context), ❓FAQ (Frequently Asked Questions) </small><br>
|
97 |
+
**<small>Generative tasks:</small>** <small>🔄LS (Lexical Substitution), 📝SU (Summarization), 🏷️NER (Named Entity Recognition), 🔗REL (Relation Extraction) </small>
|
98 |
"""
|
99 |
|
100 |
# Which evaluations are you running? how can people reproduce what you have?
|
src/tasks.py
CHANGED
@@ -23,7 +23,7 @@ Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on
|
|
23 |
MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
|
24 |
|
25 |
# Tasks Descriptions
|
26 |
-
TE_DESCRIPTION = """### Textual Entailment (TE) *
|
27 |
The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
|
28 |
|
29 |
| # | Prompt | Answer Choices |
|
@@ -39,7 +39,7 @@ TE_DESCRIPTION = """### Textual Entailment (TE) *(Multiple Choice)*
|
|
39 |
|
40 |
"""
|
41 |
|
42 |
-
SA_DESCRIPTION = """### Sentiment Analysis (SA) *
|
43 |
The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
|
44 |
|
45 |
| # | Prompt | Answer Choices |
|
@@ -55,7 +55,7 @@ SA_DESCRIPTION = """### Sentiment Analysis (SA) *(Multiple Choice)*
|
|
55 |
|
56 |
"""
|
57 |
|
58 |
-
HS_DESCRIPTION = """### Hate Speech (HS) *
|
59 |
The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
|
60 |
|
61 |
| # | Prompt | Answer Choices |
|
@@ -71,7 +71,7 @@ HS_DESCRIPTION = """### Hate Speech (HS) *(Multiple Choice)*
|
|
71 |
|
72 |
"""
|
73 |
|
74 |
-
AT_DESCRIPTION = """### Admission Tests (AT) *
|
75 |
The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
|
76 |
|
77 |
| # | Prompt | Answer Choices |
|
@@ -87,7 +87,7 @@ AT_DESCRIPTION = """### Admission Tests (AT) *(Multiple Choice)*
|
|
87 |
|
88 |
"""
|
89 |
|
90 |
-
WIC_DESCRIPTION = """### Word in Context (WIC) *
|
91 |
The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
|
92 |
|
93 |
| # | Prompt | Answer Choices |
|
@@ -103,7 +103,7 @@ WIC_DESCRIPTION = """### Word in Context (WIC) *(Multiple Choice)*
|
|
103 |
|
104 |
"""
|
105 |
|
106 |
-
FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *
|
107 |
The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
|
108 |
|
109 |
| # | Prompt | Answer Choices |
|
@@ -119,7 +119,7 @@ FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *
|
|
119 |
|
120 |
"""
|
121 |
|
122 |
-
LS_DESCRIPTION = """### Lexical Substitution (LS) *
|
123 |
The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
|
124 |
|
125 |
| # | Prompt |
|
@@ -131,7 +131,7 @@ LS_DESCRIPTION = """### Lexical Substitution (LS) *(Generative)*
|
|
131 |
|
132 |
"""
|
133 |
|
134 |
-
SU_DESCRIPTION = """### Summarization (SUM) *
|
135 |
The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
|
136 |
|
137 |
| # | Prompt |
|
@@ -143,7 +143,7 @@ SU_DESCRIPTION = """### Summarization (SUM) *(Generative)*
|
|
143 |
|
144 |
"""
|
145 |
|
146 |
-
NER_DESCRIPTION = """### Named Entity Recognition (NER) *
|
147 |
The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
|
148 |
|
149 |
| # | Prompt |
|
@@ -155,7 +155,7 @@ NER_DESCRIPTION = """### Named Entity Recognition (NER) *(Generative)*
|
|
155 |
|
156 |
"""
|
157 |
|
158 |
-
REL_DESCRIPTION = """### Relation Extraction (REL) *
|
159 |
The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
|
160 |
|
161 |
| # | Prompt |
|
|
|
23 |
MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
|
24 |
|
25 |
# Tasks Descriptions
|
26 |
+
TE_DESCRIPTION = """### Textual Entailment (TE) --- *Multiple-choice task*
|
27 |
The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
|
28 |
|
29 |
| # | Prompt | Answer Choices |
|
|
|
39 |
|
40 |
"""
|
41 |
|
42 |
+
SA_DESCRIPTION = """### Sentiment Analysis (SA) --- *Multiple-choice task*
|
43 |
The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
|
44 |
|
45 |
| # | Prompt | Answer Choices |
|
|
|
55 |
|
56 |
"""
|
57 |
|
58 |
+
HS_DESCRIPTION = """### Hate Speech (HS) --- *Multiple-choice task*
|
59 |
The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
|
60 |
|
61 |
| # | Prompt | Answer Choices |
|
|
|
71 |
|
72 |
"""
|
73 |
|
74 |
+
AT_DESCRIPTION = """### Admission Tests (AT) --- *Multiple-choice task*
|
75 |
The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
|
76 |
|
77 |
| # | Prompt | Answer Choices |
|
|
|
87 |
|
88 |
"""
|
89 |
|
90 |
+
WIC_DESCRIPTION = """### Word in Context (WIC) --- *Multiple-choice task*
|
91 |
The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
|
92 |
|
93 |
| # | Prompt | Answer Choices |
|
|
|
103 |
|
104 |
"""
|
105 |
|
106 |
+
FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) --- *Multiple-choice task*
|
107 |
The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
|
108 |
|
109 |
| # | Prompt | Answer Choices |
|
|
|
119 |
|
120 |
"""
|
121 |
|
122 |
+
LS_DESCRIPTION = """### Lexical Substitution (LS) --- *Generative task*
|
123 |
The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
|
124 |
|
125 |
| # | Prompt |
|
|
|
131 |
|
132 |
"""
|
133 |
|
134 |
+
SU_DESCRIPTION = """### Summarization (SUM) --- *Generative task*
|
135 |
The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
|
136 |
|
137 |
| # | Prompt |
|
|
|
143 |
|
144 |
"""
|
145 |
|
146 |
+
NER_DESCRIPTION = """### Named Entity Recognition (NER) --- *Generative task*
|
147 |
The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
|
148 |
|
149 |
| # | Prompt |
|
|
|
155 |
|
156 |
"""
|
157 |
|
158 |
+
REL_DESCRIPTION = """### Relation Extraction (REL) --- *Generative task*
|
159 |
The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
|
160 |
|
161 |
| # | Prompt |
|