Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
csabakecskemetiΒ 
posted an update 2 days ago
Post
2166
I've run the open llm leaderboard evaluations + hellaswag on deepseek-ai/DeepSeek-R1-Distill-Llama-8B and compared to meta-llama/Llama-3.1-8B-Instruct and at first glance R1 do not beat Llama overall.

If anyone wants to double check the results are posted here:
https://github.com/csabakecskemeti/lm_eval_results

Am I made some mistake, or (at least this distilled version) not as good/better than the competition?

I'll run the same on the Qwen 7B distilled version too.

It looks like your config set temperature to 0 , it should be 0.6 according to Usage Recommendations

Β·

Thx, will try

Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent output

Β·

Thx, will try

I've missed this suggested configuration from the model card:
"For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1."

Thanks for @shb777 and @bin110 to pointing this out!

I've rerun hellaswag with the suggested config, the results haven't improved:

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc ↑ 0.5559 Β± 0.0050
none 0 acc_norm ↑ 0.7436 Β± 0.0044

command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,generate_until=64,do_sample=True

Here is the full result or the re-executed evaluation on deepseek-ai/DeepSeek-R1-Distill-Llama-8B with the suggested gen args.

mytable2.png

I see some marginal changes in the scores but not much. If this is true the original Llama 3.1 8B wins more test than the Deepseek R1 distilled. I'm not sure what is going on. If anyone can perform the eval, please share your results.
Again I can be totally wrong here.

Full result data (results with 2025-01-26 date)
https://github.com/csabakecskemeti/lm_eval_results/blob/main/deepseek-ai__DeepSeek-R1-Distill-Llama-8B/results_2025-01-26T22-29-00.931915.json

Eval command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag,leaderboard_gpqa,leaderboard_ifeval,leaderboard_math_hard,leaderboard_mmlu_pro,leaderboard_musr,leaderboard_bbh --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,do_sample=True

Eval output:
hf (pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype=float16), gen_kwargs: (temperature=0.6,top_p=0.95,do_sample=True), limit: None, num_fewshot: None, batch_size: auto:4 (1,16,64,64)

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc ↑ 0.5559 Β± 0.0050
none 0 acc_norm ↑ 0.7436 Β± 0.0044
leaderboard_bbh N/A
- leaderboard_bbh_boolean_expressions 1 none 3 acc_norm ↑ 0.8080 Β± 0.0250
- leaderboard_bbh_causal_judgement 1 none 3 acc_norm ↑ 0.5508 Β± 0.0365
- leaderboard_bbh_date_understanding 1 none 3 acc_norm ↑ 0.4240 Β± 0.0313
- leaderboard_bbh_disambiguation_qa 1 none 3 acc_norm ↑ 0.2240 Β± 0.0264
- leaderboard_bbh_formal_fallacies 1 none 3 acc_norm ↑ 0.5200 Β± 0.0317
- leaderboard_bbh_geometric_shapes 1 none 3 acc_norm ↑ 0.2360 Β± 0.0269
- leaderboard_bbh_hyperbaton 1 none 3 acc_norm ↑ 0.4840 Β± 0.0317
- leaderboard_bbh_logical_deduction_five_objects 1 none 3 acc_norm ↑ 0.3240 Β± 0.0297
- leaderboard_bbh_logical_deduction_seven_objects 1 none 3 acc_norm ↑ 0.4200 Β± 0.0313
- leaderboard_bbh_logical_deduction_three_objects 1 none 3 acc_norm ↑ 0.4040 Β± 0.0311
- leaderboard_bbh_movie_recommendation 1 none 3 acc_norm ↑ 0.6880 Β± 0.0294
- leaderboard_bbh_navigate 1 none 3 acc_norm ↑ 0.6240 Β± 0.0307
- leaderboard_bbh_object_counting 1 none 3 acc_norm ↑ 0.4040 Β± 0.0311
- leaderboard_bbh_penguins_in_a_table 1 none 3 acc_norm ↑ 0.2945 Β± 0.0379
- leaderboard_bbh_reasoning_about_colored_objects 1 none 3 acc_norm ↑ 0.4120 Β± 0.0312
- leaderboard_bbh_ruin_names 1 none 3 acc_norm ↑ 0.4600 Β± 0.0316
- leaderboard_bbh_salient_translation_error_detection 1 none 3 acc_norm ↑ 0.3440 Β± 0.0301
- leaderboard_bbh_snarks 1 none 3 acc_norm ↑ 0.5112 Β± 0.0376
- leaderboard_bbh_sports_understanding 1 none 3 acc_norm ↑ 0.4880 Β± 0.0317
- leaderboard_bbh_temporal_sequences 1 none 3 acc_norm ↑ 0.2080 Β± 0.0257
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1 none 3 acc_norm ↑ 0.1800 Β± 0.0243
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1 none 3 acc_norm ↑ 0.1040 Β± 0.0193
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1 none 3 acc_norm ↑ 0.3400 Β± 0.0300
- leaderboard_bbh_web_of_lies 1 none 3 acc_norm ↑ 0.4880 Β± 0.0317
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm ↑ 0.2879 Β± 0.0323
- leaderboard_gpqa_extended 1 none 0 acc_norm ↑ 0.3004 Β± 0.0196
- leaderboard_gpqa_main 1 none 0 acc_norm ↑ 0.3036 Β± 0.0217
leaderboard_ifeval 3 none 0 inst_level_loose_acc ↑ 0.4556 Β± N/A
none 0 inst_level_strict_acc ↑ 0.4400 Β± N/A
none 0 prompt_level_loose_acc ↑ 0.3087 Β± 0.0199
none 0 prompt_level_strict_acc ↑ 0.2957 Β± 0.0196
leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 2 none 4 exact_match ↑ 0.4821 Β± 0.0286
- leaderboard_math_counting_and_prob_hard 2 none 4 exact_match ↑ 0.2033 Β± 0.0364
- leaderboard_math_geometry_hard 2 none 4 exact_match ↑ 0.2197 Β± 0.0362
- leaderboard_math_intermediate_algebra_hard 2 none 4 exact_match ↑ 0.0750 Β± 0.0158
- leaderboard_math_num_theory_hard 2 none 4 exact_match ↑ 0.4026 Β± 0.0396
- leaderboard_math_prealgebra_hard 2 none 4 exact_match ↑ 0.4508 Β± 0.0359
- leaderboard_math_precalculus_hard 2 none 4 exact_match ↑ 0.0963 Β± 0.0255
leaderboard_mmlu_pro 0.1 none 5 acc ↑ 0.2741 Β± 0.0041
leaderboard_musr N/A
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm ↑ 0.5200 Β± 0.0317
- leaderboard_musr_object_placements 1 none 0 acc_norm ↑ 0.3086 Β± 0.0289
- leaderboard_musr_team_allocation 1 none 0 acc_norm ↑ 0.3120 Β± 0.0294