Spaces:

CZLC
/

BenCzechMark

Running

mfajcik commited on Sep 9

Commit

4a4e00a

•

1 Parent(s): ebb78cc

Update content.py

Files changed (1) hide show

content.py CHANGED Viewed

@@ -88,7 +88,7 @@ We use the following metrics for following tasks:
 On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
 We use the following tests, with varying statistical power:
 - accuracy and exact-match: one-tailed paired t-test,
-- average area under the curve: bayesian test inspired with (Goutte et al., 2005)[https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25],
 - summarization & perplexity: bootstrapping.
 ### Duel Scoring Mechanism, Win Score

 On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
 We use the following tests, with varying statistical power:
 - accuracy and exact-match: one-tailed paired t-test,
+- average area under the curve: bayesian test inspired with [Goutte et al., 2005](https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25),
 - summarization & perplexity: bootstrapping.
 ### Duel Scoring Mechanism, Win Score