distilexp
Collection
some distillation experiments
•
4 items
•
Updated
This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.
Eval results using SmolLM evaluation scripts (LightEval):
Eval results using SmolLM evaluation scripts show distilled model slightly gained over base, in a few tasks. Small margins.
Task | Version | Metric | aloobun/d-SmolLM2-360M Value | HuggingFaceTB/SmolLM2-360M Value |
---|---|---|---|---|
all | acc_norm | 0.4653 | 0.4642 | |
qem | 0.0961 | 0.1004 | ||
custom:arc:_average:0 | acc_norm | 0.5303 | 0.5305 | |
custom:arc:challenge:0 | 0 | acc_norm | 0.3771 | 0.3797 |
custom:arc:easy:0 | 0 | acc_norm | 0.6835 | 0.6814 |
custom:commonsense_qa:0 | 0 | acc_norm | 0.3784 | 0.3759 |
custom:gsm8k:5 | 0 | qem | 0.0326 | 0.0334 |
custom:hellaswag:0 | 0 | acc_norm | 0.5418 | 0.5456 |
custom:mmlu_pro:0 | 0 | acc_norm | 0.1127 | 0.1130 |
custom:openbook_qa:0 | 0 | acc_norm | 0.3760 | 0.3720 |
custom:piqa:0 | 0 | acc_norm | 0.7214 | 0.7220 |
custom:trivia_qa:0 | 0 | qem | 0.1596 | 0.1675 |
custom:winogrande:0 | 0 | acc_norm | 0.5312 | 0.5241 |
Eval results using lm-eval evaluation scripts:
It slightly improves upon the performance of the basemodel on the following tasks:
Tasks | HuggingFaceTB/SmolLM2-360M Value | aloobun/d-SmolLM2-360M Value |
---|---|---|
- leaderboard_bbh_causal_judgement | 0.4545 | 0.4652 |
- leaderboard_bbh_geometric_shapes | 0.1680 | 0.2040 |
- leaderboard_bbh_movie_recommendation | 0.2120 | 0.2440 |
- leaderboard_bbh_penguins_in_a_table | 0.2055 | 0.2123 |
- leaderboard_bbh_reasoning_about_colored_objects | 0.1160 | 0.1320 |
- leaderboard_bbh_ruin_names | 0.2360 | 0.2480 |
- leaderboard_bbh_salient_translation_error_detection | 0.1480 | 0.2120 |
- leaderboard_bbh_snarks | 0.5169 | 0.5281 |
- leaderboard_bbh_temporal_sequences | 0.2720 | 0.2800 |
- leaderboard_musr_murder_mysteries | 0.5040 | 0.5160 |
Well, it didn’t work as well as I hoped, will try again.
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_gpqa | N/A | |||||||
- leaderboard_gpqa_diamond | 1 | none | 0 | acc_norm | ↑ | 0.2071 | ± | 0.0289 |
- leaderboard_gpqa_extended | 1 | none | 0 | acc_norm | ↑ | 0.2308 | ± | 0.0180 |
- leaderboard_gpqa_main | 1 | none | 0 | acc_norm | ↑ | 0.2679 | ± | 0.0209 |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_musr | N/A | |||||||
- leaderboard_musr_murder_mysteries | 1 | none | 0 | acc_norm | ↑ | 0.5160 | ± | 0.0317 |
- leaderboard_musr_object_placements | 1 | none | 0 | acc_norm | ↑ | 0.2383 | ± | 0.0267 |
- leaderboard_musr_team_allocation | 1 | none | 0 | acc_norm | ↑ | 0.4400 | ± | 0.0315 |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_bbh | N/A | |||||||
- leaderboard_bbh_boolean_expressions | 1 | none | 3 | acc_norm | ↑ | 0.5480 | ± | 0.0315 |
- leaderboard_bbh_causal_judgement | 1 | none | 3 | acc_norm | ↑ | 0.4652 | ± | 0.0366 |
- leaderboard_bbh_date_understanding | 1 | none | 3 | acc_norm | ↑ | 0.1560 | ± | 0.0230 |
- leaderboard_bbh_disambiguation_qa | 1 | none | 3 | acc_norm | ↑ | 0.3120 | ± | 0.0294 |
- leaderboard_bbh_formal_fallacies | 1 | none | 3 | acc_norm | ↑ | 0.5240 | ± | 0.0316 |
- leaderboard_bbh_geometric_shapes | 1 | none | 3 | acc_norm | ↑ | 0.2040 | ± | 0.0255 |
- leaderboard_bbh_hyperbaton | 1 | none | 3 | acc_norm | ↑ | 0.5000 | ± | 0.0317 |
- leaderboard_bbh_logical_deduction_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.2240 | ± | 0.0264 |
- leaderboard_bbh_logical_deduction_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.1440 | ± | 0.0222 |
- leaderboard_bbh_logical_deduction_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3320 | ± | 0.0298 |
- leaderboard_bbh_movie_recommendation | 1 | none | 3 | acc_norm | ↑ | 0.2440 | ± | 0.0272 |
- leaderboard_bbh_navigate | 1 | none | 3 | acc_norm | ↑ | 0.5800 | ± | 0.0313 |
- leaderboard_bbh_object_counting | 1 | none | 3 | acc_norm | ↑ | 0.2080 | ± | 0.0257 |
- leaderboard_bbh_penguins_in_a_table | 1 | none | 3 | acc_norm | ↑ | 0.2123 | ± | 0.0340 |
- leaderboard_bbh_reasoning_about_colored_objects | 1 | none | 3 | acc_norm | ↑ | 0.1320 | ± | 0.0215 |
- leaderboard_bbh_ruin_names | 1 | none | 3 | acc_norm | ↑ | 0.2480 | ± | 0.0274 |
- leaderboard_bbh_salient_translation_error_detection | 1 | none | 3 | acc_norm | ↑ | 0.2120 | ± | 0.0259 |
- leaderboard_bbh_snarks | 1 | none | 3 | acc_norm | ↑ | 0.5281 | ± | 0.0375 |
- leaderboard_bbh_sports_understanding | 1 | none | 3 | acc_norm | ↑ | 0.4600 | ± | 0.0316 |
- leaderboard_bbh_temporal_sequences | 1 | none | 3 | acc_norm | ↑ | 0.2800 | ± | 0.0285 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.1720 | ± | 0.0239 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.1440 | ± | 0.0222 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3000 | ± | 0.0290 |
- leaderboard_bbh_web_of_lies | 1 | none | 3 | acc_norm | ↑ | 0.5480 | ± | 0.0315 |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.1173 | ± | 0.0029 |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_ifeval | 3 | none | 0 | inst_level_loose_acc | ↑ | 0.2866 | ± | N/A |
none | 0 | inst_level_strict_acc | ↑ | 0.2770 | ± | N/A | ||
none | 0 | prompt_level_loose_acc | ↑ | 0.1497 | ± | 0.0154 | ||
none | 0 | prompt_level_strict_acc | ↑ | 0.1423 | ± | 0.0150 |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_math_hard | N/A | |||||||
- leaderboard_math_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.0033 | ± | 0.0033 |
- leaderboard_math_counting_and_prob_hard | 2 | none | 4 | exact_match | ↑ | 0.0081 | ± | 0.0081 |
- leaderboard_math_geometry_hard | 2 | none | 4 | exact_match | ↑ | 0.0000 | ± | 0.0000 |
- leaderboard_math_intermediate_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.0000 | ± | 0.0000 |
- leaderboard_math_num_theory_hard | 2 | none | 4 | exact_match | ↑ | 0.0065 | ± | 0.0065 |
- leaderboard_math_prealgebra_hard | 2 | none | 4 | exact_match | ↑ | 0.0104 | ± | 0.0073 |
- leaderboard_math_precalculus_hard | 2 | none | 4 | exact_match | ↑ | 0.0000 | ± | 0.0000 |
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 6.01 |
IFEval (0-Shot) | 20.97 |
BBH (3-Shot) | 4.76 |
MATH Lvl 5 (4-Shot) | 0.23 |
GPQA (0-shot) | 0.45 |
MuSR (0-shot) | 7.76 |
MMLU-PRO (5-shot) | 1.88 |