Update README.md
Browse files
README.md
CHANGED
@@ -80,16 +80,6 @@ Details are in the paper’s Appendix.
|
|
80 |
## Evaluation
|
81 |
The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
82 |
|
83 |
-
Evaluation Results (Experiment 1)
|
84 |
-
|
85 |
-
|Tokens (B) | OpenBookQA |TriviaQA| HellaSwag| SQuAD2.0| XWINO| MMLU| GSM8K| BBH| HumanEval| HumanEval+|
|
86 |
-
|---|---|---|---|---|---|---|---|---|---|---|
|
87 |
-
|10 |0.3640| 0.6659| 0.5995| 0.3354| 0.9032| 0.6294| 0.4602| 0.6019| 0.3366| 0.3366|
|
88 |
-
|20 |0.3540| 0.6567| 0.6019| 0.3360| 0.9024| 0.6238| 0.4852| 0.5898| 0.3433| 0.3433|
|
89 |
-
|30 |0.3700| 0.6588| 0.6034| 0.3377| 0.9045| 0.6263| 0.5072| 0.5939| 0.3402| 0.3421|
|
90 |
-
|40 |0.3800| 0.6618| 0.6053| 0.3380| 0.9097| 0.6341| 0.5011| 0.6016| 0.3659| 0.3701|
|
91 |
-
|50 |0.3700| 0.6679| 0.6054| 0.3350| 0.9045| 0.6340| 0.5027| 0.6091| 0.3689| 0.3720|
|
92 |
-
|
93 |
## Citation
|
94 |
|
95 |
```bibtex
|
|
|
80 |
## Evaluation
|
81 |
The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
## Citation
|
84 |
|
85 |
```bibtex
|