The released final ckpt and stage2-ingredient1-step23852-tokens51B ckpt have different eval results

#2
by wydwww - opened

As mentioned in #1, the released final checkpoint corresponds to ingredient 1, stage2-ingredient1-step23852-tokens51B. I use lm-evaluation-harness to evaluate allenai/OLMo-2-0425-1B and stage2-ingredient1-step23852-tokens51B, and they have different results on MMLU and gsm8k.

Can you please clarify why the released ckpt has lower evaluation results? Thanks.

MMLU:
released final:

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.4257|±  |0.0041|
| - humanities     |      2|none  |      |acc   |↑  |0.3947|±  |0.0069|
| - other          |      2|none  |      |acc   |↑  |0.4870|±  |0.0088|
| - social sciences|      2|none  |      |acc   |↑  |0.4807|±  |0.0089|
| - stem           |      2|none  |      |acc   |↑  |0.3578|±  |0.0084|

stage2-ingredient1-step23852-tokens51B:

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.4417|±  |0.0041|
| - humanities     |      2|none  |      |acc   |↑  |0.4136|±  |0.0069|
| - other          |      2|none  |      |acc   |↑  |0.4957|±  |0.0088|
| - social sciences|      2|none  |      |acc   |↑  |0.5018|±  |0.0088|
| - stem           |      2|none  |      |acc   |↑  |0.3717|±  |0.0085|

gsm8k:
released final:

hf (pretrained=allenai/OLMo-2-0425-1B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     4|exact_match|↑  |0.4079|±  |0.0135|
|         |       |strict-match    |     4|exact_match|↑  |0.4003|±  |0.0135|

stage2-ingredient1-step23852-tokens51B:

hf (pretrained=allenai/OLMo-2-0425-1B,revision=stage2-ingredient1-step23852-tokens51B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     4|exact_match|↑  |0.4594|±  |0.0137|
|         |       |strict-match    |     4|exact_match|↑  |0.4223|±  |0.0136|

I use the same evaluation setting:

lm_eval --model hf \
    --model_args pretrained=allenai/OLMo-2-0425-1B(,revision=stage2-ingredient1-step23852-tokens51B) \
    --tasks gsm8k_cot\
    --batch_size auto \
    --num_fewshot 4 \
    --trust_remote_code \
    --confirm_run_unsafe_code

Also the description in allenai/OLMo claims that the released main ckpt is merged from soup, which are different from the description on the hf model page and #1.

Hey @wydwww , thanks for raising this issue. I have cross verified with the team on this again.

  1. There is no model souping (there was a typo in README file on Github OLMo repo, I fixed it).
  2. From my #1 comment, I was wrong. Ingredient 3 is seed 42 and it is the final main checkpoint. Not the ingredients 1 and 2, they are just exploratory anneals. I addressed it in #1.
  3. To clear out things, I have updated the readme.

Sorry for the inconvenience. You can retry the evals.

amanrangapur changed discussion status to closed

Thanks for your reply @amanrangapur . I ran the gsm8k eval of stage2-ingredient3-step23852-tokens51B with the same command, and still got a significantly higher result (0.4549) than the main ckpt (0.4079). FYI, the ingredient 2 ckpt has a 0.4556 score in this setting. Did you use any post-processing to get the final ckpt?

hf (pretrained=allenai/OLMo-2-0425-1B,revision=stage2-ingredient3-step23852-tokens51B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     4|exact_match|↑  |0.4549|±  |0.0137|
|         |       |strict-match    |     4|exact_match|↑  |0.4511|±  |0.0137|

Hey @wydwww , we did not use any post-processing on final checkpoint. We selected one of the ingredients (anneals) based on average scores of evals.

@amanrangapur It seems that the final ckpt does not match any of the 3 ingredient ckpts. Do you have some thoughts on this? Can you please verify the main and stage2-ingredient3-step23852-tokens51Bckpts are the same in your setting? Thanks.

Sign up or log in to comment