This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.

Eval results using SmolLM evaluation scripts (LightEval):

Eval results using SmolLM evaluation scripts show distilled model slightly gained over base, in a few tasks. Small margins.

Task Version Metric aloobun/d-SmolLM2-360M Value HuggingFaceTB/SmolLM2-360M Value
all acc_norm 0.4653 0.4642
qem 0.0961 0.1004
custom:arc:_average:0 acc_norm 0.5303 0.5305
custom:arc:challenge:0 0 acc_norm 0.3771 0.3797
custom:arc:easy:0 0 acc_norm 0.6835 0.6814
custom:commonsense_qa:0 0 acc_norm 0.3784 0.3759
custom:gsm8k:5 0 qem 0.0326 0.0334
custom:hellaswag:0 0 acc_norm 0.5418 0.5456
custom:mmlu_pro:0 0 acc_norm 0.1127 0.1130
custom:openbook_qa:0 0 acc_norm 0.3760 0.3720
custom:piqa:0 0 acc_norm 0.7214 0.7220
custom:trivia_qa:0 0 qem 0.1596 0.1675
custom:winogrande:0 0 acc_norm 0.5312 0.5241

Eval results using lm-eval evaluation scripts:

It slightly improves upon the performance of the basemodel on the following tasks:

Tasks HuggingFaceTB/SmolLM2-360M Value aloobun/d-SmolLM2-360M Value
- leaderboard_bbh_causal_judgement 0.4545 0.4652
- leaderboard_bbh_geometric_shapes 0.1680 0.2040
- leaderboard_bbh_movie_recommendation 0.2120 0.2440
- leaderboard_bbh_penguins_in_a_table 0.2055 0.2123
- leaderboard_bbh_reasoning_about_colored_objects 0.1160 0.1320
- leaderboard_bbh_ruin_names 0.2360 0.2480
- leaderboard_bbh_salient_translation_error_detection 0.1480 0.2120
- leaderboard_bbh_snarks 0.5169 0.5281
- leaderboard_bbh_temporal_sequences 0.2720 0.2800
- leaderboard_musr_murder_mysteries 0.5040 0.5160

Well, it didn’t work as well as I hoped, will try again.

Eval Results aloobun/d-SmolLM2-360M (WIP)

GPQA

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm ↑ 0.2071 ± 0.0289
- leaderboard_gpqa_extended 1 none 0 acc_norm ↑ 0.2308 ± 0.0180
- leaderboard_gpqa_main 1 none 0 acc_norm ↑ 0.2679 ± 0.0209

MUSR

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_musr N/A
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm ↑ 0.5160 ± 0.0317
- leaderboard_musr_object_placements 1 none 0 acc_norm ↑ 0.2383 ± 0.0267
- leaderboard_musr_team_allocation 1 none 0 acc_norm ↑ 0.4400 ± 0.0315

BBH

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_bbh N/A
- leaderboard_bbh_boolean_expressions 1 none 3 acc_norm ↑ 0.5480 ± 0.0315
- leaderboard_bbh_causal_judgement 1 none 3 acc_norm ↑ 0.4652 ± 0.0366
- leaderboard_bbh_date_understanding 1 none 3 acc_norm ↑ 0.1560 ± 0.0230
- leaderboard_bbh_disambiguation_qa 1 none 3 acc_norm ↑ 0.3120 ± 0.0294
- leaderboard_bbh_formal_fallacies 1 none 3 acc_norm ↑ 0.5240 ± 0.0316
- leaderboard_bbh_geometric_shapes 1 none 3 acc_norm ↑ 0.2040 ± 0.0255
- leaderboard_bbh_hyperbaton 1 none 3 acc_norm ↑ 0.5000 ± 0.0317
- leaderboard_bbh_logical_deduction_five_objects 1 none 3 acc_norm ↑ 0.2240 ± 0.0264
- leaderboard_bbh_logical_deduction_seven_objects 1 none 3 acc_norm ↑ 0.1440 ± 0.0222
- leaderboard_bbh_logical_deduction_three_objects 1 none 3 acc_norm ↑ 0.3320 ± 0.0298
- leaderboard_bbh_movie_recommendation 1 none 3 acc_norm ↑ 0.2440 ± 0.0272
- leaderboard_bbh_navigate 1 none 3 acc_norm ↑ 0.5800 ± 0.0313
- leaderboard_bbh_object_counting 1 none 3 acc_norm ↑ 0.2080 ± 0.0257
- leaderboard_bbh_penguins_in_a_table 1 none 3 acc_norm ↑ 0.2123 ± 0.0340
- leaderboard_bbh_reasoning_about_colored_objects 1 none 3 acc_norm ↑ 0.1320 ± 0.0215
- leaderboard_bbh_ruin_names 1 none 3 acc_norm ↑ 0.2480 ± 0.0274
- leaderboard_bbh_salient_translation_error_detection 1 none 3 acc_norm ↑ 0.2120 ± 0.0259
- leaderboard_bbh_snarks 1 none 3 acc_norm ↑ 0.5281 ± 0.0375
- leaderboard_bbh_sports_understanding 1 none 3 acc_norm ↑ 0.4600 ± 0.0316
- leaderboard_bbh_temporal_sequences 1 none 3 acc_norm ↑ 0.2800 ± 0.0285
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1 none 3 acc_norm ↑ 0.1720 ± 0.0239
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1 none 3 acc_norm ↑ 0.1440 ± 0.0222
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1 none 3 acc_norm ↑ 0.3000 ± 0.0290
- leaderboard_bbh_web_of_lies 1 none 3 acc_norm ↑ 0.5480 ± 0.0315

MMLU_PRO

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_mmlu_pro 0.1 none 5 acc ↑ 0.1173 ± 0.0029

IFEVAL

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_ifeval 3 none 0 inst_level_loose_acc ↑ 0.2866 ± N/A
none 0 inst_level_strict_acc ↑ 0.2770 ± N/A
none 0 prompt_level_loose_acc ↑ 0.1497 ± 0.0154
none 0 prompt_level_strict_acc ↑ 0.1423 ± 0.0150

MATH HARD

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 2 none 4 exact_match ↑ 0.0033 ± 0.0033
- leaderboard_math_counting_and_prob_hard 2 none 4 exact_match ↑ 0.0081 ± 0.0081
- leaderboard_math_geometry_hard 2 none 4 exact_match ↑ 0.0000 ± 0.0000
- leaderboard_math_intermediate_algebra_hard 2 none 4 exact_match ↑ 0.0000 ± 0.0000
- leaderboard_math_num_theory_hard 2 none 4 exact_match ↑ 0.0065 ± 0.0065
- leaderboard_math_prealgebra_hard 2 none 4 exact_match ↑ 0.0104 ± 0.0073
- leaderboard_math_precalculus_hard 2 none 4 exact_match ↑ 0.0000 ± 0.0000

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 6.01
IFEval (0-Shot) 20.97
BBH (3-Shot) 4.76
MATH Lvl 5 (4-Shot) 0.23
GPQA (0-shot) 0.45
MuSR (0-shot) 7.76
MMLU-PRO (5-shot) 1.88
Downloads last month
65
Safetensors
Model size
362M params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for aloobun/d-SmolLM2-360M

Quantizations
1 model

Collection including aloobun/d-SmolLM2-360M

Evaluation results