Here is the full result or the re-executed evaluation on deepseek-ai/DeepSeek-R1-Distill-Llama-8B with the suggested gen args.
I see some marginal changes in the scores but not much. If this is true the original Llama 3.1 8B wins more test than the Deepseek R1 distilled. I'm not sure what is going on. If anyone can perform the eval, please share your results.
Again I can be totally wrong here.
Full result data (results with 2025-01-26 date)
https://github.com/csabakecskemeti/lm_eval_results/blob/main/deepseek-ai__DeepSeek-R1-Distill-Llama-8B/results_2025-01-26T22-29-00.931915.json
Eval command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag,leaderboard_gpqa,leaderboard_ifeval,leaderboard_math_hard,leaderboard_mmlu_pro,leaderboard_musr,leaderboard_bbh --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,do_sample=True
Eval output:
hf (pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype=float16), gen_kwargs: (temperature=0.6,top_p=0.95,do_sample=True), limit: None, num_fewshot: None, batch_size: auto:4 (1,16,64,64)
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
hellaswag |
1 |
none |
0 |
acc |
β |
0.5559 |
Β± |
0.0050 |
|
|
none |
0 |
acc_norm |
β |
0.7436 |
Β± |
0.0044 |
leaderboard_bbh |
N/A |
|
|
|
|
|
|
|
- leaderboard_bbh_boolean_expressions |
1 |
none |
3 |
acc_norm |
β |
0.8080 |
Β± |
0.0250 |
- leaderboard_bbh_causal_judgement |
1 |
none |
3 |
acc_norm |
β |
0.5508 |
Β± |
0.0365 |
- leaderboard_bbh_date_understanding |
1 |
none |
3 |
acc_norm |
β |
0.4240 |
Β± |
0.0313 |
- leaderboard_bbh_disambiguation_qa |
1 |
none |
3 |
acc_norm |
β |
0.2240 |
Β± |
0.0264 |
- leaderboard_bbh_formal_fallacies |
1 |
none |
3 |
acc_norm |
β |
0.5200 |
Β± |
0.0317 |
- leaderboard_bbh_geometric_shapes |
1 |
none |
3 |
acc_norm |
β |
0.2360 |
Β± |
0.0269 |
- leaderboard_bbh_hyperbaton |
1 |
none |
3 |
acc_norm |
β |
0.4840 |
Β± |
0.0317 |
- leaderboard_bbh_logical_deduction_five_objects |
1 |
none |
3 |
acc_norm |
β |
0.3240 |
Β± |
0.0297 |
- leaderboard_bbh_logical_deduction_seven_objects |
1 |
none |
3 |
acc_norm |
β |
0.4200 |
Β± |
0.0313 |
- leaderboard_bbh_logical_deduction_three_objects |
1 |
none |
3 |
acc_norm |
β |
0.4040 |
Β± |
0.0311 |
- leaderboard_bbh_movie_recommendation |
1 |
none |
3 |
acc_norm |
β |
0.6880 |
Β± |
0.0294 |
- leaderboard_bbh_navigate |
1 |
none |
3 |
acc_norm |
β |
0.6240 |
Β± |
0.0307 |
- leaderboard_bbh_object_counting |
1 |
none |
3 |
acc_norm |
β |
0.4040 |
Β± |
0.0311 |
- leaderboard_bbh_penguins_in_a_table |
1 |
none |
3 |
acc_norm |
β |
0.2945 |
Β± |
0.0379 |
- leaderboard_bbh_reasoning_about_colored_objects |
1 |
none |
3 |
acc_norm |
β |
0.4120 |
Β± |
0.0312 |
- leaderboard_bbh_ruin_names |
1 |
none |
3 |
acc_norm |
β |
0.4600 |
Β± |
0.0316 |
- leaderboard_bbh_salient_translation_error_detection |
1 |
none |
3 |
acc_norm |
β |
0.3440 |
Β± |
0.0301 |
- leaderboard_bbh_snarks |
1 |
none |
3 |
acc_norm |
β |
0.5112 |
Β± |
0.0376 |
- leaderboard_bbh_sports_understanding |
1 |
none |
3 |
acc_norm |
β |
0.4880 |
Β± |
0.0317 |
- leaderboard_bbh_temporal_sequences |
1 |
none |
3 |
acc_norm |
β |
0.2080 |
Β± |
0.0257 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects |
1 |
none |
3 |
acc_norm |
β |
0.1800 |
Β± |
0.0243 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects |
1 |
none |
3 |
acc_norm |
β |
0.1040 |
Β± |
0.0193 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects |
1 |
none |
3 |
acc_norm |
β |
0.3400 |
Β± |
0.0300 |
- leaderboard_bbh_web_of_lies |
1 |
none |
3 |
acc_norm |
β |
0.4880 |
Β± |
0.0317 |
leaderboard_gpqa |
N/A |
|
|
|
|
|
|
|
- leaderboard_gpqa_diamond |
1 |
none |
0 |
acc_norm |
β |
0.2879 |
Β± |
0.0323 |
- leaderboard_gpqa_extended |
1 |
none |
0 |
acc_norm |
β |
0.3004 |
Β± |
0.0196 |
- leaderboard_gpqa_main |
1 |
none |
0 |
acc_norm |
β |
0.3036 |
Β± |
0.0217 |
leaderboard_ifeval |
3 |
none |
0 |
inst_level_loose_acc |
β |
0.4556 |
Β± |
N/A |
|
|
none |
0 |
inst_level_strict_acc |
β |
0.4400 |
Β± |
N/A |
|
|
none |
0 |
prompt_level_loose_acc |
β |
0.3087 |
Β± |
0.0199 |
|
|
none |
0 |
prompt_level_strict_acc |
β |
0.2957 |
Β± |
0.0196 |
leaderboard_math_hard |
N/A |
|
|
|
|
|
|
|
- leaderboard_math_algebra_hard |
2 |
none |
4 |
exact_match |
β |
0.4821 |
Β± |
0.0286 |
- leaderboard_math_counting_and_prob_hard |
2 |
none |
4 |
exact_match |
β |
0.2033 |
Β± |
0.0364 |
- leaderboard_math_geometry_hard |
2 |
none |
4 |
exact_match |
β |
0.2197 |
Β± |
0.0362 |
- leaderboard_math_intermediate_algebra_hard |
2 |
none |
4 |
exact_match |
β |
0.0750 |
Β± |
0.0158 |
- leaderboard_math_num_theory_hard |
2 |
none |
4 |
exact_match |
β |
0.4026 |
Β± |
0.0396 |
- leaderboard_math_prealgebra_hard |
2 |
none |
4 |
exact_match |
β |
0.4508 |
Β± |
0.0359 |
- leaderboard_math_precalculus_hard |
2 |
none |
4 |
exact_match |
β |
0.0963 |
Β± |
0.0255 |
leaderboard_mmlu_pro |
0.1 |
none |
5 |
acc |
β |
0.2741 |
Β± |
0.0041 |
leaderboard_musr |
N/A |
|
|
|
|
|
|
|
- leaderboard_musr_murder_mysteries |
1 |
none |
0 |
acc_norm |
β |
0.5200 |
Β± |
0.0317 |
- leaderboard_musr_object_placements |
1 |
none |
0 |
acc_norm |
β |
0.3086 |
Β± |
0.0289 |
- leaderboard_musr_team_allocation |
1 |
none |
0 |
acc_norm |
β |
0.3120 |
Β± |
0.0294 |