This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.

Eval results using SmolLM evaluation scripts (LightEval):

Eval results using SmolLM evaluation scripts show distilled model slightly gained over base, in a few tasks. Small margins.

Task	Version	Metric	aloobun/d-SmolLM2-360M Value	HuggingFaceTB/SmolLM2-360M Value
all		acc_norm	0.4653	0.4642
		qem	0.0961	0.1004
custom:arc:_average:0		acc_norm	0.5303	0.5305
custom:arc:challenge:0	0	acc_norm	0.3771	0.3797
custom:arc:easy:0	0	acc_norm	0.6835	0.6814
custom:commonsense_qa:0	0	acc_norm	0.3784	0.3759
custom:gsm8k:5	0	qem	0.0326	0.0334
custom:hellaswag:0	0	acc_norm	0.5418	0.5456
custom:mmlu_pro:0	0	acc_norm	0.1127	0.1130
custom:openbook_qa:0	0	acc_norm	0.3760	0.3720
custom:piqa:0	0	acc_norm	0.7214	0.7220
custom:trivia_qa:0	0	qem	0.1596	0.1675
custom:winogrande:0	0	acc_norm	0.5312	0.5241

Eval results using lm-eval evaluation scripts:

It slightly improves upon the performance of the basemodel on the following tasks:

Tasks	HuggingFaceTB/SmolLM2-360M Value	aloobun/d-SmolLM2-360M Value
- leaderboard_bbh_causal_judgement	0.4545	0.4652
- leaderboard_bbh_geometric_shapes	0.1680	0.2040
- leaderboard_bbh_movie_recommendation	0.2120	0.2440
- leaderboard_bbh_penguins_in_a_table	0.2055	0.2123
- leaderboard_bbh_reasoning_about_colored_objects	0.1160	0.1320
- leaderboard_bbh_ruin_names	0.2360	0.2480
- leaderboard_bbh_salient_translation_error_detection	0.1480	0.2120
- leaderboard_bbh_snarks	0.5169	0.5281
- leaderboard_bbh_temporal_sequences	0.2720	0.2800
- leaderboard_musr_murder_mysteries	0.5040	0.5160

Well, it didn’t work as well as I hoped, will try again.

Eval Results aloobun/d-SmolLM2-360M (WIP)

GPQA

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2071	±	0.0289
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.2308	±	0.0180
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.2679	±	0.0209

MUSR

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_musr	N/A
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.5160	±	0.0317
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.2383	±	0.0267
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.4400	±	0.0315

BBH

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_bbh	N/A
- leaderboard_bbh_boolean_expressions	1	none	3	acc_norm	↑	0.5480	±	0.0315
- leaderboard_bbh_causal_judgement	1	none	3	acc_norm	↑	0.4652	±	0.0366
- leaderboard_bbh_date_understanding	1	none	3	acc_norm	↑	0.1560	±	0.0230
- leaderboard_bbh_disambiguation_qa	1	none	3	acc_norm	↑	0.3120	±	0.0294
- leaderboard_bbh_formal_fallacies	1	none	3	acc_norm	↑	0.5240	±	0.0316
- leaderboard_bbh_geometric_shapes	1	none	3	acc_norm	↑	0.2040	±	0.0255
- leaderboard_bbh_hyperbaton	1	none	3	acc_norm	↑	0.5000	±	0.0317
- leaderboard_bbh_logical_deduction_five_objects	1	none	3	acc_norm	↑	0.2240	±	0.0264
- leaderboard_bbh_logical_deduction_seven_objects	1	none	3	acc_norm	↑	0.1440	±	0.0222
- leaderboard_bbh_logical_deduction_three_objects	1	none	3	acc_norm	↑	0.3320	±	0.0298
- leaderboard_bbh_movie_recommendation	1	none	3	acc_norm	↑	0.2440	±	0.0272
- leaderboard_bbh_navigate	1	none	3	acc_norm	↑	0.5800	±	0.0313
- leaderboard_bbh_object_counting	1	none	3	acc_norm	↑	0.2080	±	0.0257
- leaderboard_bbh_penguins_in_a_table	1	none	3	acc_norm	↑	0.2123	±	0.0340
- leaderboard_bbh_reasoning_about_colored_objects	1	none	3	acc_norm	↑	0.1320	±	0.0215
- leaderboard_bbh_ruin_names	1	none	3	acc_norm	↑	0.2480	±	0.0274
- leaderboard_bbh_salient_translation_error_detection	1	none	3	acc_norm	↑	0.2120	±	0.0259
- leaderboard_bbh_snarks	1	none	3	acc_norm	↑	0.5281	±	0.0375
- leaderboard_bbh_sports_understanding	1	none	3	acc_norm	↑	0.4600	±	0.0316
- leaderboard_bbh_temporal_sequences	1	none	3	acc_norm	↑	0.2800	±	0.0285
- leaderboard_bbh_tracking_shuffled_objects_five_objects	1	none	3	acc_norm	↑	0.1720	±	0.0239
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	1	none	3	acc_norm	↑	0.1440	±	0.0222
- leaderboard_bbh_tracking_shuffled_objects_three_objects	1	none	3	acc_norm	↑	0.3000	±	0.0290
- leaderboard_bbh_web_of_lies	1	none	3	acc_norm	↑	0.5480	±	0.0315

MMLU_PRO

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.1173	±	0.0029

IFEVAL

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.2866	±	N/A
		none	inst_level_strict_acc	↑	0.2770	±	N/A
		none	prompt_level_loose_acc	↑	0.1497	±	0.0154
		none	prompt_level_strict_acc	↑	0.1423	±	0.0150

MATH HARD

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_math_hard	N/A
- leaderboard_math_algebra_hard	2	none	4	exact_match	↑	0.0033	±	0.0033
- leaderboard_math_counting_and_prob_hard	2	none	4	exact_match	↑	0.0081	±	0.0081
- leaderboard_math_geometry_hard	2	none	4	exact_match	↑	0.0000	±	0.0000
- leaderboard_math_intermediate_algebra_hard	2	none	4	exact_match	↑	0.0000	±	0.0000
- leaderboard_math_num_theory_hard	2	none	4	exact_match	↑	0.0065	±	0.0065
- leaderboard_math_prealgebra_hard	2	none	4	exact_match	↑	0.0104	±	0.0073
- leaderboard_math_precalculus_hard	2	none	4	exact_match	↑	0.0000	±	0.0000

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	6.01
IFEval (0-Shot)	20.97
BBH (3-Shot)	4.76
MATH Lvl 5 (4-Shot)	0.23
GPQA (0-shot)	0.45
MuSR (0-shot)	7.76
MMLU-PRO (5-shot)	1.88

aloobun
/

d-SmolLM2-360M

Eval Results aloobun/d-SmolLM2-360M (WIP)

GPQA

MUSR

BBH

MMLU_PRO

IFEVAL

MATH HARD

Open LLM Leaderboard Evaluation Results

Model tree for aloobun/d-SmolLM2-360M

Collection including aloobun/d-SmolLM2-360M

distilexp

Evaluation results