Update README.md
Browse files
README.md
CHANGED
@@ -84,18 +84,6 @@ Details are in the paper’s Appendix.
|
|
84 |
## Evaluation
|
85 |
The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
86 |
|
87 |
-
Evaluation Results (Experiment 2)
|
88 |
-
|
89 |
-
| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | GSM8K | BBH | HumanEval | HumanEval+ |
|
90 |
-
|------------|------------|----------|-----------|----------|-------|--------|--------|--------|-----------|------------|
|
91 |
-
| 10 | 0.3560 | 0.6675 | 0.6015 | 0.3385 | 0.9062| 0.6321 | 0.4784 | 0.5881 | 0.3604 | 0.3713 |
|
92 |
-
| 20 | 0.3520 | 0.6635 | 0.6026 | 0.3364 | 0.9049| 0.6252 | 0.4784 | 0.5781 | 0.3591 | 0.3585 |
|
93 |
-
| 30 | 0.3560 | 0.6637 | 0.6012 | 0.3375 | 0.9080| 0.6313 | 0.5019 | 0.5950 | 0.3701 | 0.3762 |
|
94 |
-
| 40 | 0.3580 | 0.6679 | 0.6046 | 0.3346 | 0.9062| 0.6330 | 0.5019 | 0.5998 | 0.3720 | 0.3689 |
|
95 |
-
| 50 | 0.3660 | 0.6694 | 0.6055 | 0.3340 | 0.9084| 0.6325 | 0.5155 | 0.6044 | 0.3787 | 0.3787 |
|
96 |
-
|
97 |
-
*Source: Table 3 from the SwallowCode paper, showing performance of the syntax-error-free Python subset.*
|
98 |
-
|
99 |
|
100 |
## Citation
|
101 |
|
|
|
84 |
## Evaluation
|
85 |
The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
## Citation
|
89 |
|