mfajcik commited on
Commit
43aa8ea
1 Parent(s): 3b741bb

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +56 -0
content.py CHANGED
@@ -67,6 +67,62 @@ Are you sure you want to submit your model?
67
 
68
  ABOUT_MARKDOWN = """
69
  # About
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  - **BenCzechMark Authors & Contributors:**
71
  - **BUT FIT**
72
  - Martin Fajčík
 
67
 
68
  ABOUT_MARKDOWN = """
69
  # About
70
+ ## Abstract
71
+ We present **B**en**C**zech**M**ark (BCM), the first multitask and multimetric Czech language benchmark for large language models with a unique scoring system that utilizes the theory of statistical significance. Our benchmark covers 54 challenging, mostly native Czech tasks spanning across 11 categories, including diverse domains such as historical Czech, pupil and language learner essays, and spoken word.
72
+
73
+ Furthermore, we collect and clean the [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC), the largest publicly available clean Czech language corpus, and continuously pretrain the first Czech-centric 7B language model [CSMPT7B](https://huggingface.co/BUT-FIT/csmpt7b), with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models.
74
+
75
+ ## Methodology
76
+ While we will reveal more details in our upcoming work, here is how leaderboard ranking works in a nutshell.
77
+
78
+ ### Prompting Mechanism
79
+ Each task (except for tasks from language modelling category) is composed of 5 or more prompts. The performance of every model is then max-pooled over tasks (best performance counts).
80
+
81
+ ### Metrics and Significance Testing
82
+ We use the following metrics for following tasks:
83
+
84
+ - Fixed-class Classification: average area under the curve (one-vs-all average)
85
+ - Multichoice Classification: accuracy
86
+ - Question Answering: exact match
87
+ - Summarization: rouge-raw (2-gram)
88
+ - Language Modeling : word-level perplexity
89
+
90
+ On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
91
+ We use the following tests, with varying statistical power:
92
+ - accuracy and exact-match: one-tailed paired t-test,
93
+ - average area under the curve: bayesian test inspired with ((Goutte et al., 2005)[https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25]),
94
+ - summarization & perplexity: bootstrapping.
95
+
96
+ ### Duel Scoring Mechanism, Win Score
97
+ On each task, each model is scored to each model (up to top-50 currently submitted models). For each model, record proportion of won duels: **Win Score**(WS).
98
+ Next, the *Category Win Score**(CWS), is computed as an average over model's WSs in that category. Similarly, 🇨🇿 **BenCzechMark Win Score** is computed as model's average CWS across categories.
99
+ The properties of this ranking mechanism include:
100
+ - Ranking can change after every submission.
101
+ - The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
102
+ - It allows utilizing wide spectrum of existing resources, evaluated under different metrics.
103
+
104
+ ## Baseline Setup
105
+ The models submitted to leaderboard by the authors were evaluated in following setup:
106
+ - max input length: 2048 tokens
107
+ - number of shown examples (few-shot mechanism): 3-shot
108
+ - truncation: smart truncation
109
+ - log-probability aggregation: average-pooling
110
+ - chat templates: not used
111
+
112
+ ## Citation
113
+ You can use the following citation for this leaderboard and our upcoming work.
114
+ ```bibtex
115
+ @article{fajcik2024benczechmark,
116
+ title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
117
+ author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
118
+ year = {2024},
119
+ url = {https://huggingface.co/spaces/CZLC/BenCzechMark}
120
+ institution = {Brno University of Technology, Masaryk University, Czech Technical University in Prague, Hugging Face},
121
+ }
122
+ ```
123
+
124
+
125
+ ## Authors & Correspondence
126
  - **BenCzechMark Authors & Contributors:**
127
  - **BUT FIT**
128
  - Martin Fajčík