Spaces:
Running
Running
Update content.py
Browse files- content.py +56 -0
content.py
CHANGED
@@ -67,6 +67,62 @@ Are you sure you want to submit your model?
|
|
67 |
|
68 |
ABOUT_MARKDOWN = """
|
69 |
# About
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
- **BenCzechMark Authors & Contributors:**
|
71 |
- **BUT FIT**
|
72 |
- Martin Fajčík
|
|
|
67 |
|
68 |
ABOUT_MARKDOWN = """
|
69 |
# About
|
70 |
+
## Abstract
|
71 |
+
We present **B**en**C**zech**M**ark (BCM), the first multitask and multimetric Czech language benchmark for large language models with a unique scoring system that utilizes the theory of statistical significance. Our benchmark covers 54 challenging, mostly native Czech tasks spanning across 11 categories, including diverse domains such as historical Czech, pupil and language learner essays, and spoken word.
|
72 |
+
|
73 |
+
Furthermore, we collect and clean the [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC), the largest publicly available clean Czech language corpus, and continuously pretrain the first Czech-centric 7B language model [CSMPT7B](https://huggingface.co/BUT-FIT/csmpt7b), with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models.
|
74 |
+
|
75 |
+
## Methodology
|
76 |
+
While we will reveal more details in our upcoming work, here is how leaderboard ranking works in a nutshell.
|
77 |
+
|
78 |
+
### Prompting Mechanism
|
79 |
+
Each task (except for tasks from language modelling category) is composed of 5 or more prompts. The performance of every model is then max-pooled over tasks (best performance counts).
|
80 |
+
|
81 |
+
### Metrics and Significance Testing
|
82 |
+
We use the following metrics for following tasks:
|
83 |
+
|
84 |
+
- Fixed-class Classification: average area under the curve (one-vs-all average)
|
85 |
+
- Multichoice Classification: accuracy
|
86 |
+
- Question Answering: exact match
|
87 |
+
- Summarization: rouge-raw (2-gram)
|
88 |
+
- Language Modeling : word-level perplexity
|
89 |
+
|
90 |
+
On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
|
91 |
+
We use the following tests, with varying statistical power:
|
92 |
+
- accuracy and exact-match: one-tailed paired t-test,
|
93 |
+
- average area under the curve: bayesian test inspired with ((Goutte et al., 2005)[https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25]),
|
94 |
+
- summarization & perplexity: bootstrapping.
|
95 |
+
|
96 |
+
### Duel Scoring Mechanism, Win Score
|
97 |
+
On each task, each model is scored to each model (up to top-50 currently submitted models). For each model, record proportion of won duels: **Win Score**(WS).
|
98 |
+
Next, the *Category Win Score**(CWS), is computed as an average over model's WSs in that category. Similarly, 🇨🇿 **BenCzechMark Win Score** is computed as model's average CWS across categories.
|
99 |
+
The properties of this ranking mechanism include:
|
100 |
+
- Ranking can change after every submission.
|
101 |
+
- The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
|
102 |
+
- It allows utilizing wide spectrum of existing resources, evaluated under different metrics.
|
103 |
+
|
104 |
+
## Baseline Setup
|
105 |
+
The models submitted to leaderboard by the authors were evaluated in following setup:
|
106 |
+
- max input length: 2048 tokens
|
107 |
+
- number of shown examples (few-shot mechanism): 3-shot
|
108 |
+
- truncation: smart truncation
|
109 |
+
- log-probability aggregation: average-pooling
|
110 |
+
- chat templates: not used
|
111 |
+
|
112 |
+
## Citation
|
113 |
+
You can use the following citation for this leaderboard and our upcoming work.
|
114 |
+
```bibtex
|
115 |
+
@article{fajcik2024benczechmark,
|
116 |
+
title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
|
117 |
+
author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
|
118 |
+
year = {2024},
|
119 |
+
url = {https://huggingface.co/spaces/CZLC/BenCzechMark}
|
120 |
+
institution = {Brno University of Technology, Masaryk University, Czech Technical University in Prague, Hugging Face},
|
121 |
+
}
|
122 |
+
```
|
123 |
+
|
124 |
+
|
125 |
+
## Authors & Correspondence
|
126 |
- **BenCzechMark Authors & Contributors:**
|
127 |
- **BUT FIT**
|
128 |
- Martin Fajčík
|