@csabakecskemeti on Hugging Face: "I've run the open llm leaderboard evaluations + hellaswag on…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update Jan 26

Post

2351

I've run the open llm leaderboard evaluations + hellaswag on deepseek-ai/DeepSeek-R1-Distill-Llama-8B and compared to meta-llama/Llama-3.1-8B-Instruct and at first glance R1 do not beat Llama overall.

If anyone wants to double check the results are posted here:
https://github.com/csabakecskemeti/lm_eval_results

Am I made some mistake, or (at least this distilled version) not as good/better than the competition?

I'll run the same on the Qwen 7B distilled version too.

shb777

Jan 26

•

edited Jan 26

It looks like your config set temperature to 0 , it should be 0.6 according to Usage Recommendations

csabakecskemeti

Jan 26

Thx, will try

bin110

Jan 26

Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent output

csabakecskemeti

Jan 26

Thx, will try

csabakecskemeti

Jan 26

I've missed this suggested configuration from the model card:
"For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1."

Thanks for @shb777 and @bin110 to pointing this out!

csabakecskemeti

Jan 26

I've rerun hellaswag with the suggested config, the results haven't improved:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.5559	±	0.0050
		none	0	acc_norm	↑	0.7436	±	0.0044

command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,generate_until=64,do_sample=True

csabakecskemeti

Jan 28

Here is the full result or the re-executed evaluation on deepseek-ai/DeepSeek-R1-Distill-Llama-8B with the suggested gen args.

I see some marginal changes in the scores but not much. If this is true the original Llama 3.1 8B wins more test than the Deepseek R1 distilled. I'm not sure what is going on. If anyone can perform the eval, please share your results.
Again I can be totally wrong here.

Full result data (results with 2025-01-26 date)
https://github.com/csabakecskemeti/lm_eval_results/blob/main/deepseek-ai__DeepSeek-R1-Distill-Llama-8B/results_2025-01-26T22-29-00.931915.json

Eval command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag,leaderboard_gpqa,leaderboard_ifeval,leaderboard_math_hard,leaderboard_mmlu_pro,leaderboard_musr,leaderboard_bbh --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,do_sample=True

Eval output:
hf (pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype=float16), gen_kwargs: (temperature=0.6,top_p=0.95,do_sample=True), limit: None, num_fewshot: None, batch_size: auto:4 (1,16,64,64)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.5559	±	0.0050
		none	0	acc_norm	↑	0.7436	±	0.0044
leaderboard_bbh	N/A
- leaderboard_bbh_boolean_expressions	1	none	3	acc_norm	↑	0.8080	±	0.0250
- leaderboard_bbh_causal_judgement	1	none	3	acc_norm	↑	0.5508	±	0.0365
- leaderboard_bbh_date_understanding	1	none	3	acc_norm	↑	0.4240	±	0.0313
- leaderboard_bbh_disambiguation_qa	1	none	3	acc_norm	↑	0.2240	±	0.0264
- leaderboard_bbh_formal_fallacies	1	none	3	acc_norm	↑	0.5200	±	0.0317
- leaderboard_bbh_geometric_shapes	1	none	3	acc_norm	↑	0.2360	±	0.0269
- leaderboard_bbh_hyperbaton	1	none	3	acc_norm	↑	0.4840	±	0.0317
- leaderboard_bbh_logical_deduction_five_objects	1	none	3	acc_norm	↑	0.3240	±	0.0297
- leaderboard_bbh_logical_deduction_seven_objects	1	none	3	acc_norm	↑	0.4200	±	0.0313
- leaderboard_bbh_logical_deduction_three_objects	1	none	3	acc_norm	↑	0.4040	±	0.0311
- leaderboard_bbh_movie_recommendation	1	none	3	acc_norm	↑	0.6880	±	0.0294
- leaderboard_bbh_navigate	1	none	3	acc_norm	↑	0.6240	±	0.0307
- leaderboard_bbh_object_counting	1	none	3	acc_norm	↑	0.4040	±	0.0311
- leaderboard_bbh_penguins_in_a_table	1	none	3	acc_norm	↑	0.2945	±	0.0379
- leaderboard_bbh_reasoning_about_colored_objects	1	none	3	acc_norm	↑	0.4120	±	0.0312
- leaderboard_bbh_ruin_names	1	none	3	acc_norm	↑	0.4600	±	0.0316
- leaderboard_bbh_salient_translation_error_detection	1	none	3	acc_norm	↑	0.3440	±	0.0301
- leaderboard_bbh_snarks	1	none	3	acc_norm	↑	0.5112	±	0.0376
- leaderboard_bbh_sports_understanding	1	none	3	acc_norm	↑	0.4880	±	0.0317
- leaderboard_bbh_temporal_sequences	1	none	3	acc_norm	↑	0.2080	±	0.0257
- leaderboard_bbh_tracking_shuffled_objects_five_objects	1	none	3	acc_norm	↑	0.1800	±	0.0243
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	1	none	3	acc_norm	↑	0.1040	±	0.0193
- leaderboard_bbh_tracking_shuffled_objects_three_objects	1	none	3	acc_norm	↑	0.3400	±	0.0300
- leaderboard_bbh_web_of_lies	1	none	3	acc_norm	↑	0.4880	±	0.0317
leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2879	±	0.0323
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.3004	±	0.0196
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.3036	±	0.0217
leaderboard_ifeval	3	none	0	inst_level_loose_acc	↑	0.4556	±	N/A
		none	0	inst_level_strict_acc	↑	0.4400	±	N/A
		none	0	prompt_level_loose_acc	↑	0.3087	±	0.0199
		none	0	prompt_level_strict_acc	↑	0.2957	±	0.0196
leaderboard_math_hard	N/A
- leaderboard_math_algebra_hard	2	none	4	exact_match	↑	0.4821	±	0.0286
- leaderboard_math_counting_and_prob_hard	2	none	4	exact_match	↑	0.2033	±	0.0364
- leaderboard_math_geometry_hard	2	none	4	exact_match	↑	0.2197	±	0.0362
- leaderboard_math_intermediate_algebra_hard	2	none	4	exact_match	↑	0.0750	±	0.0158
- leaderboard_math_num_theory_hard	2	none	4	exact_match	↑	0.4026	±	0.0396
- leaderboard_math_prealgebra_hard	2	none	4	exact_match	↑	0.4508	±	0.0359
- leaderboard_math_precalculus_hard	2	none	4	exact_match	↑	0.0963	±	0.0255
leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.2741	±	0.0041
leaderboard_musr	N/A
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.5200	±	0.0317
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.3086	±	0.0289
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.3120	±	0.0294

In this post