tokyotech-llm
/

Llama-3.1-8B-code-ablation-exp2-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0002500

Model card Files Files and versions Community

kazukifujii commited on Jul 4

Commit

9461818

·

verified ·

1 Parent(s): ce8b390

Update README.md

Files changed (1) hide show

README.md +0 -12

README.md CHANGED Viewed

@@ -84,18 +84,6 @@ Details are in the paper’s Appendix.
 ## Evaluation
 The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
-Evaluation Results (Experiment 2)
-| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU   | GSM8K  | BBH    | HumanEval | HumanEval+ |
-|------------|------------|----------|-----------|----------|-------|--------|--------|--------|-----------|------------|
-| 10         | 0.3560     | 0.6675   | 0.6015    | 0.3385   | 0.9062| 0.6321 | 0.4784 | 0.5881 | 0.3604    | 0.3713     |
-| 20         | 0.3520     | 0.6635   | 0.6026    | 0.3364   | 0.9049| 0.6252 | 0.4784 | 0.5781 | 0.3591    | 0.3585     |
-| 30         | 0.3560     | 0.6637   | 0.6012    | 0.3375   | 0.9080| 0.6313 | 0.5019 | 0.5950 | 0.3701    | 0.3762     |
-| 40         | 0.3580     | 0.6679   | 0.6046    | 0.3346   | 0.9062| 0.6330 | 0.5019 | 0.5998 | 0.3720    | 0.3689     |
-| 50         | 0.3660     | 0.6694   | 0.6055    | 0.3340   | 0.9084| 0.6325 | 0.5155 | 0.6044 | 0.3787    | 0.3787     |
-*Source: Table 3 from the SwallowCode paper, showing performance of the syntax-error-free Python subset.*
 ## Citation

 ## Evaluation
 The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
 ## Citation