Spaces:
Running
Running
Update content.py
Browse files- content.py +1 -1
content.py
CHANGED
@@ -95,7 +95,7 @@ We use the following tests, with varying statistical power:
|
|
95 |
|
96 |
### Duel Scoring Mechanism, Win Score
|
97 |
On each task, each model is scored to each model (up to top-50 currently submitted models). For each model, record proportion of won duels: **Win Score**(WS).
|
98 |
-
Next, the
|
99 |
The properties of this ranking mechanism include:
|
100 |
- Ranking can change after every submission.
|
101 |
- The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
|
|
|
95 |
|
96 |
### Duel Scoring Mechanism, Win Score
|
97 |
On each task, each model is scored to each model (up to top-50 currently submitted models). For each model, record proportion of won duels: **Win Score**(WS).
|
98 |
+
Next, the **Category Win Score**(CWS), is computed as an average over model's WSs in that category. Similarly, π¨πΏ **BenCzechMark Win Score** is computed as model's average CWS across categories.
|
99 |
The properties of this ranking mechanism include:
|
100 |
- Ranking can change after every submission.
|
101 |
- The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
|