tokyotech-llm
/

Llama-3.1-8B-code-ablation-exp3-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0012500

Model card Files Files and versions Community

kazukifujii commited on Jul 4

Commit

a0980e8

·

verified ·

1 Parent(s): 9897b2c

Update README.md

Files changed (1) hide show

README.md +0 -14

README.md CHANGED Viewed

@@ -86,20 +86,6 @@ Details are in the paper’s Appendix.
 ## Evaluation
 The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
-Evaluation Results (Experiment 3)
-### Evaluation Results (Experiment 3)
-| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU   | GSM8K  | BBH    | HumanEval | HumanEval+ |
-|------------|------------|----------|-----------|----------|-------|--------|--------|--------|-----------|------------|
-| 10         | 0.3560     | 0.6628   | 0.6010    | 0.3340   | 0.9071| 0.6235 | 0.4564 | 0.6007 | 0.3500    | 0.3488     |
-| 20         | 0.3500     | 0.6613   | 0.6015    | 0.3361   | 0.9054| 0.6237 | 0.4860 | 0.5838 | 0.3744    | 0.3787     |
-| 30         | 0.3620     | 0.6596   | 0.6008    | 0.3359   | 0.9080| 0.6307 | 0.4867 | 0.5921 | 0.3957    | 0.3878     |
-| 40         | 0.3720     | 0.6650   | 0.6030    | 0.3352   | 0.9058| 0.6326 | 0.4822 | 0.5990 | 0.3890    | 0.3915     |
-| 50         | 0.3740     | 0.6677   | 0.6054    | 0.3291   | 0.9019| 0.6327 | 0.4996 | 0.6145 | 0.3945    | 0.3902     |
-*Source: Table 4 from the SwallowCode paper, showing performance of the syntax-error and Pylint-filtered (score ≥ 7) Python subset.*
 ## Citation

 ## Evaluation
 The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
 ## Citation