Csaba Kecskemeti PRO
AI & ML interests
Recent Activity
Organizations
csabakecskemeti's activity
Here is the full result or the re-executed evaluation on deepseek-ai/DeepSeek-R1-Distill-Llama-8B with the suggested gen args.
I see some marginal changes in the scores but not much. If this is true the original Llama 3.1 8B wins more test than the Deepseek R1 distilled. I'm not sure what is going on. If anyone can perform the eval, please share your results.
Again I can be totally wrong here.
Full result data (results with 2025-01-26 date)
https://github.com/csabakecskemeti/lm_eval_results/blob/main/deepseek-ai__DeepSeek-R1-Distill-Llama-8B/results_2025-01-26T22-29-00.931915.json
Eval command:accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag,leaderboard_gpqa,leaderboard_ifeval,leaderboard_math_hard,leaderboard_mmlu_pro,leaderboard_musr,leaderboard_bbh --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,do_sample=True
Eval output:
hf (pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype=float16), gen_kwargs: (temperature=0.6,top_p=0.95,do_sample=True), limit: None, num_fewshot: None, batch_size: auto:4 (1,16,64,64)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.5559 | ± | 0.0050 |
none | 0 | acc_norm | ↑ | 0.7436 | ± | 0.0044 | ||
leaderboard_bbh | N/A | |||||||
- leaderboard_bbh_boolean_expressions | 1 | none | 3 | acc_norm | ↑ | 0.8080 | ± | 0.0250 |
- leaderboard_bbh_causal_judgement | 1 | none | 3 | acc_norm | ↑ | 0.5508 | ± | 0.0365 |
- leaderboard_bbh_date_understanding | 1 | none | 3 | acc_norm | ↑ | 0.4240 | ± | 0.0313 |
- leaderboard_bbh_disambiguation_qa | 1 | none | 3 | acc_norm | ↑ | 0.2240 | ± | 0.0264 |
- leaderboard_bbh_formal_fallacies | 1 | none | 3 | acc_norm | ↑ | 0.5200 | ± | 0.0317 |
- leaderboard_bbh_geometric_shapes | 1 | none | 3 | acc_norm | ↑ | 0.2360 | ± | 0.0269 |
- leaderboard_bbh_hyperbaton | 1 | none | 3 | acc_norm | ↑ | 0.4840 | ± | 0.0317 |
- leaderboard_bbh_logical_deduction_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.3240 | ± | 0.0297 |
- leaderboard_bbh_logical_deduction_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.4200 | ± | 0.0313 |
- leaderboard_bbh_logical_deduction_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.4040 | ± | 0.0311 |
- leaderboard_bbh_movie_recommendation | 1 | none | 3 | acc_norm | ↑ | 0.6880 | ± | 0.0294 |
- leaderboard_bbh_navigate | 1 | none | 3 | acc_norm | ↑ | 0.6240 | ± | 0.0307 |
- leaderboard_bbh_object_counting | 1 | none | 3 | acc_norm | ↑ | 0.4040 | ± | 0.0311 |
- leaderboard_bbh_penguins_in_a_table | 1 | none | 3 | acc_norm | ↑ | 0.2945 | ± | 0.0379 |
- leaderboard_bbh_reasoning_about_colored_objects | 1 | none | 3 | acc_norm | ↑ | 0.4120 | ± | 0.0312 |
- leaderboard_bbh_ruin_names | 1 | none | 3 | acc_norm | ↑ | 0.4600 | ± | 0.0316 |
- leaderboard_bbh_salient_translation_error_detection | 1 | none | 3 | acc_norm | ↑ | 0.3440 | ± | 0.0301 |
- leaderboard_bbh_snarks | 1 | none | 3 | acc_norm | ↑ | 0.5112 | ± | 0.0376 |
- leaderboard_bbh_sports_understanding | 1 | none | 3 | acc_norm | ↑ | 0.4880 | ± | 0.0317 |
- leaderboard_bbh_temporal_sequences | 1 | none | 3 | acc_norm | ↑ | 0.2080 | ± | 0.0257 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.1800 | ± | 0.0243 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.1040 | ± | 0.0193 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3400 | ± | 0.0300 |
- leaderboard_bbh_web_of_lies | 1 | none | 3 | acc_norm | ↑ | 0.4880 | ± | 0.0317 |
leaderboard_gpqa | N/A | |||||||
- leaderboard_gpqa_diamond | 1 | none | 0 | acc_norm | ↑ | 0.2879 | ± | 0.0323 |
- leaderboard_gpqa_extended | 1 | none | 0 | acc_norm | ↑ | 0.3004 | ± | 0.0196 |
- leaderboard_gpqa_main | 1 | none | 0 | acc_norm | ↑ | 0.3036 | ± | 0.0217 |
leaderboard_ifeval | 3 | none | 0 | inst_level_loose_acc | ↑ | 0.4556 | ± | N/A |
none | 0 | inst_level_strict_acc | ↑ | 0.4400 | ± | N/A | ||
none | 0 | prompt_level_loose_acc | ↑ | 0.3087 | ± | 0.0199 | ||
none | 0 | prompt_level_strict_acc | ↑ | 0.2957 | ± | 0.0196 | ||
leaderboard_math_hard | N/A | |||||||
- leaderboard_math_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.4821 | ± | 0.0286 |
- leaderboard_math_counting_and_prob_hard | 2 | none | 4 | exact_match | ↑ | 0.2033 | ± | 0.0364 |
- leaderboard_math_geometry_hard | 2 | none | 4 | exact_match | ↑ | 0.2197 | ± | 0.0362 |
- leaderboard_math_intermediate_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.0750 | ± | 0.0158 |
- leaderboard_math_num_theory_hard | 2 | none | 4 | exact_match | ↑ | 0.4026 | ± | 0.0396 |
- leaderboard_math_prealgebra_hard | 2 | none | 4 | exact_match | ↑ | 0.4508 | ± | 0.0359 |
- leaderboard_math_precalculus_hard | 2 | none | 4 | exact_match | ↑ | 0.0963 | ± | 0.0255 |
leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.2741 | ± | 0.0041 |
leaderboard_musr | N/A | |||||||
- leaderboard_musr_murder_mysteries | 1 | none | 0 | acc_norm | ↑ | 0.5200 | ± | 0.0317 |
- leaderboard_musr_object_placements | 1 | none | 0 | acc_norm | ↑ | 0.3086 | ± | 0.0289 |
- leaderboard_musr_team_allocation | 1 | none | 0 | acc_norm | ↑ | 0.3120 | ± | 0.0294 |
I've rerun hellaswag with the suggested config, the results haven't improved:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.5559 | ± | 0.0050 |
none | 0 | acc_norm | ↑ | 0.7436 | ± | 0.0044 |
command:accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,generate_until=64,do_sample=True
Thx, will try
Thx, will try