Spaces:
Running
Running
Update content.py
Browse files- content.py +1 -1
content.py
CHANGED
@@ -88,7 +88,7 @@ We use the following metrics for following tasks:
|
|
88 |
On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
|
89 |
We use the following tests, with varying statistical power:
|
90 |
- accuracy and exact-match: one-tailed paired t-test,
|
91 |
-
- average area under the curve: bayesian test inspired with
|
92 |
- summarization & perplexity: bootstrapping.
|
93 |
|
94 |
### Duel Scoring Mechanism, Win Score
|
|
|
88 |
On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
|
89 |
We use the following tests, with varying statistical power:
|
90 |
- accuracy and exact-match: one-tailed paired t-test,
|
91 |
+
- average area under the curve: bayesian test inspired with [Goutte et al., 2005](https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25),
|
92 |
- summarization & perplexity: bootstrapping.
|
93 |
|
94 |
### Duel Scoring Mechanism, Win Score
|