HuggingFaceTB/SmolLM2-360M · Missmatch between SmolLM2-360M-intermediate-checkpoints and SmolLM2-360M performance

Hey, I have run some evaluations (using lm-evaluation-harness). I wanted to understand the dynamics on some reasoning tasks. For that, I evaluated all checkpoints you provided at https://huggingface.co/HuggingFaceTB/SmolLM2-360M-intermediate-checkpoints. To my understanding, the final 2400k model from those checkpoints should equal the HuggingFaceTB/SmolLM2-360M model. However, I found a significant difference (see plot below). Could you clarify whether these are the correct checkpoints or if the HuggingFaceTB/SmolLM2-360M is a fine-tuned model?
(All evaluations were done with the same script.)