Update README.md
Browse files
README.md
CHANGED
@@ -86,20 +86,6 @@ Details are in the paper’s Appendix.
|
|
86 |
## Evaluation
|
87 |
The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
88 |
|
89 |
-
Evaluation Results (Experiment 3)
|
90 |
-
|
91 |
-
### Evaluation Results (Experiment 3)
|
92 |
-
|
93 |
-
| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | GSM8K | BBH | HumanEval | HumanEval+ |
|
94 |
-
|------------|------------|----------|-----------|----------|-------|--------|--------|--------|-----------|------------|
|
95 |
-
| 10 | 0.3560 | 0.6628 | 0.6010 | 0.3340 | 0.9071| 0.6235 | 0.4564 | 0.6007 | 0.3500 | 0.3488 |
|
96 |
-
| 20 | 0.3500 | 0.6613 | 0.6015 | 0.3361 | 0.9054| 0.6237 | 0.4860 | 0.5838 | 0.3744 | 0.3787 |
|
97 |
-
| 30 | 0.3620 | 0.6596 | 0.6008 | 0.3359 | 0.9080| 0.6307 | 0.4867 | 0.5921 | 0.3957 | 0.3878 |
|
98 |
-
| 40 | 0.3720 | 0.6650 | 0.6030 | 0.3352 | 0.9058| 0.6326 | 0.4822 | 0.5990 | 0.3890 | 0.3915 |
|
99 |
-
| 50 | 0.3740 | 0.6677 | 0.6054 | 0.3291 | 0.9019| 0.6327 | 0.4996 | 0.6145 | 0.3945 | 0.3902 |
|
100 |
-
|
101 |
-
*Source: Table 4 from the SwallowCode paper, showing performance of the syntax-error and Pylint-filtered (score ≥ 7) Python subset.*
|
102 |
-
|
103 |
|
104 |
## Citation
|
105 |
|
|
|
86 |
## Evaluation
|
87 |
The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
## Citation
|
91 |
|