The released final ckpt and stage2-ingredient1-step23852-tokens51B ckpt have different eval results
As mentioned in #1, the released final checkpoint corresponds to ingredient 1, stage2-ingredient1-step23852-tokens51B
. I use lm-evaluation-harness to evaluate allenai/OLMo-2-0425-1B
and stage2-ingredient1-step23852-tokens51B
, and they have different results on MMLU and gsm8k.
Can you please clarify why the released ckpt has lower evaluation results? Thanks.
MMLU:
released final:
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.4257|± |0.0041|
| - humanities | 2|none | |acc |↑ |0.3947|± |0.0069|
| - other | 2|none | |acc |↑ |0.4870|± |0.0088|
| - social sciences| 2|none | |acc |↑ |0.4807|± |0.0089|
| - stem | 2|none | |acc |↑ |0.3578|± |0.0084|
stage2-ingredient1-step23852-tokens51B
:
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.4417|± |0.0041|
| - humanities | 2|none | |acc |↑ |0.4136|± |0.0069|
| - other | 2|none | |acc |↑ |0.4957|± |0.0088|
| - social sciences| 2|none | |acc |↑ |0.5018|± |0.0088|
| - stem | 2|none | |acc |↑ |0.3717|± |0.0085|
gsm8k:
released final:
hf (pretrained=allenai/OLMo-2-0425-1B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 4|exact_match|↑ |0.4079|± |0.0135|
| | |strict-match | 4|exact_match|↑ |0.4003|± |0.0135|
stage2-ingredient1-step23852-tokens51B
:
hf (pretrained=allenai/OLMo-2-0425-1B,revision=stage2-ingredient1-step23852-tokens51B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 4|exact_match|↑ |0.4594|± |0.0137|
| | |strict-match | 4|exact_match|↑ |0.4223|± |0.0136|
I use the same evaluation setting:
lm_eval --model hf \
--model_args pretrained=allenai/OLMo-2-0425-1B(,revision=stage2-ingredient1-step23852-tokens51B) \
--tasks gsm8k_cot\
--batch_size auto \
--num_fewshot 4 \
--trust_remote_code \
--confirm_run_unsafe_code
Also the description in allenai/OLMo claims that the released main ckpt is merged from soup, which are different from the description on the hf model page and #1.
Hey @wydwww , thanks for raising this issue. I have cross verified with the team on this again.
- There is no model souping (there was a typo in README file on Github OLMo repo, I fixed it).
- From my #1 comment, I was wrong. Ingredient 3 is seed 42 and it is the final main checkpoint. Not the ingredients 1 and 2, they are just exploratory anneals. I addressed it in #1.
- To clear out things, I have updated the readme.
Sorry for the inconvenience. You can retry the evals.
Thanks for your reply
@amanrangapur
. I ran the gsm8k eval of stage2-ingredient3-step23852-tokens51B
with the same command, and still got a significantly higher result (0.4549) than the main ckpt (0.4079). FYI, the ingredient 2 ckpt has a 0.4556 score in this setting. Did you use any post-processing to get the final ckpt?
hf (pretrained=allenai/OLMo-2-0425-1B,revision=stage2-ingredient3-step23852-tokens51B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 4|exact_match|↑ |0.4549|± |0.0137|
| | |strict-match | 4|exact_match|↑ |0.4511|± |0.0137|
Hey @wydwww , we did not use any post-processing on final checkpoint. We selected one of the ingredients (anneals) based on average scores of evals.
@amanrangapur
It seems that the final ckpt does not match any of the 3 ingredient ckpts. Do you have some thoughts on this? Can you please verify the main and stage2-ingredient3-step23852-tokens51B
ckpts are the same in your setting? Thanks.