File size: 8,189 Bytes
f417250
 
 
b66f230
0d38c87
f417250
b133d09
 
5be5d06
a9f0694
 
95a2c0e
 
5be5d06
fce75db
f417250
b66f230
f417250
 
8a54af0
 
 
 
8864264
5be5d06
 
 
8864264
5be5d06
8864264
 
 
578e86d
5be5d06
 
 
8a54af0
 
 
f417250
b66f230
f417250
935ac4f
 
 
 
 
 
f417250
b66f230
935ac4f
f417250
b66f230
 
 
f417250
23931c3
 
 
 
 
 
 
 
7c3d9a0
23931c3
 
 
 
27b6247
a1840bf
 
43aa8ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a4e00a
43aa8ea
 
 
 
715ec0a
43aa8ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1840bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3d49a6
 
 
 
 
a1840bf
 
 
1c5642f
a1840bf
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
"""
This file contains the text content for the leaderboard client.
"""
HEADER_MARKDOWN = """
# 🇨🇿 BenCzechMark

Welcome to the leaderboard!  
Here you can compare models on tasks in Czech language and/or submit your own model. We use our modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness) to evaluate every model under same protocol.

- Head to **Submission** page to learn about submission details.
- See **About** page for brief description of our evaluation protocol & win score mechanism, citation information, and future directions for this benchmark.
- In submission page, __you can obtain results on leaderboard without publishing them__.
    - First step is "pre-submission", and after this is done (significance tests can take up to an hour), the results can be submitted if you'd like to.

"""
LEADERBOARD_TAB_TITLE_MARKDOWN = """
   """

SUBMISSION_TAB_TITLE_MARKDOWN = """
    ## How to submit
    1. Head down to our modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness). 
    Follow the instructions and evaluate your model on all 🇨🇿 BenCzechMark tasks, while logging your lm harness outputs into designated folder.

    2. Use our script from [benczechmark-leaderboard](https://github.com/MFajcik/benczechmark-leaderboard) repository for processing log files from your designated folder into single compact submission file that contains everything we need.  
    Example usage:
   - Download sample outputs for csmpt7b from [csmpt_logdir.zip](https://czechllm.fit.vutbr.cz/csmpt7b/sample_results/csmpt_logdir.zip).
   - Unzip.
   - Run the script from leaderboard repository with python (with libs jsonlines and tqdm)
   ```bash 
   git clone https://github.com/MFajcik/benczechmark-leaderboard.git
   cd benczechmark-leaderboard/
   export PYTHONPATH=$(pwd)
   python leaderboard/compile_log_files.py \
   -i "<your_local_path_to_folder>/csmpt_logdir/csmpt/eval_csmpt7b*" \
   -o "<your_local_path_to_outfolder>/sample_submission.json"
   ```

    3. Upload your file, and fill the form below!
    
    ## Submission
    To submit your model, please fill in the form below.
    
    - *Team name:* The name of your team, as it will appear on the leaderboard
    - *Model name:* The name of your model
    - *Model type:* The type of your model (chat, pretrained, ensemble)
    - *Parameters (B):* The number of parameters of your model in billions (10⁹)
    - *Input length (# tokens):* The number of input tokens that led to the results
    - *Precision:* The precision with which the results were obtained
    - *Description:* Short description of your submission (optional)
    - *Link to model:* Link to the model's repository or documentation
    - *Upload your results:* Results json file to submit
    
    After filling in the form, click the **Pre-submit model** button. 
    This will run a comparison of your model with the existing leaderboard models. 
    After the tournament is complete, you will be able to submit your model to the leaderboard.
"""

RANKING_AFTER_SUBMISSION_MARKDOWN = """
                    This is how will ranking look like after your submission:
                    """
SUBMISSION_DETAILS_MARKDOWN = """
                    Do you really want to submit a model? This action is irreversible.
                    """
MORE_DETAILS_MARKDOWN = """
Here you can view, how selected model won/lost duels to all other models, in selected 🇨🇿 BenCzechMark category.
"""

MODAL_SUBMIT_MARKDOWN = """
Are you sure you want to submit your model?
"""

ABOUT_MARKDOWN = """
## Abstract
We present **B**en**C**zech**M**ark (BCM), the first multitask and multimetric Czech language benchmark for large language models with a unique scoring system that utilizes the theory of statistical significance. Our benchmark covers 54 challenging, mostly native Czech tasks spanning across 11 categories, including diverse domains such as historical Czech, pupil and language learner essays, and spoken word.

Furthermore, we collect and clean the [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC), the largest publicly available clean Czech language corpus, and continuously pretrain the first Czech-centric 7B language model [CSMPT7B](https://huggingface.co/BUT-FIT/csmpt7b), with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models.

## Methodology
While we will reveal more details in our upcoming work, here is how leaderboard ranking works in a nutshell.  

### Prompting Mechanism
Each task (except for tasks from language modelling category) is composed of 5 or more prompts. The performance of every model is then max-pooled over tasks (best performance counts).

### Metrics and Significance Testing
We use the following metrics for following tasks:

- Fixed-class Classification: average area under the curve (one-vs-all average)
- Multichoice Classification: accuracy
- Question Answering: exact match
- Summarization: rouge-raw (2-gram)
- Language Modeling : word-level perplexity

On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
We use the following tests, with varying statistical power:
- accuracy and exact-match: one-tailed paired t-test,
- average area under the curve: bayesian test inspired with [Goutte et al., 2005](https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25),
- summarization & perplexity: bootstrapping.

### Duel Scoring Mechanism, Win Score
On each task, each model is scored to each model (up to top-50 currently submitted models). For each model, record proportion of won duels: **Win Score**(WS).
Next, the **Category Win Score**(CWS), is computed as an average over model's WSs in that category. Similarly, 🇨🇿 **BenCzechMark Win Score** is computed as model's average CWS across categories. 
The properties of this ranking mechanism include:
- Ranking can change after every submission.
- The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
- It allows utilizing wide spectrum of existing resources, evaluated under different metrics.

## Baseline Setup
The models submitted to leaderboard by the authors were evaluated in following setup:
- max input length: 2048 tokens
- number of shown examples (few-shot mechanism): 3-shot
- truncation: smart truncation
- log-probability aggregation: average-pooling
- chat templates: not used

## Citation
You can use the following citation for this leaderboard and our upcoming work.
```bibtex
@article{fajcik2024benczechmark,
  title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
  author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
  year = {2024},
  url = {https://huggingface.co/spaces/CZLC/BenCzechMark}
  institution = {Brno University of Technology, Masaryk University, Czech Technical University in Prague, Hugging Face},
}
```


## Authors & Correspondence
- **BenCzechMark Authors & Contributors:**
  - **BUT FIT**
    - Martin Fajčík
    - Martin Dočekal
    - Jan Doležal
    - Karel Ondřej
    - Karel Beneš
    - Jan Kapsa
    - Michal Hradiš
  - **FI MUNI**
    - Zuzana Nevěřilová
    - Aleš Horák
    - Michal Štefánik
  - **CIIRC CTU**
    - Adam Jirkovský
    - David Adamczyk
    - Jan Hůla
    - Jan Šedivý
  - **Hugging Face**
    - Hynek Kydlíček

- **Leaderboard Authors & Contributors:**
    - Jan Doležal - Coding and Troubleshooting
    - Martin Fajčík - Management & Debugging
    - Alexander Polok, Jakub Štetina - Leaderboard Version 0.1


**Correspondence to:**
- Martin Fajčík
- Brno University of Technology, Brno, Czech Republic
- Email: [[email protected]](mailto:[email protected])
"""