Csaba  Kecskemeti's picture

Csaba Kecskemeti PRO

csabakecskemeti

AI & ML interests

None yet

Recent Activity

updated a model 3 minutes ago
DevQuasar/ozone-ai.0x-lite-GGUF
published a model 3 minutes ago
DevQuasar/ozone-ai.0x-lite-GGUF
updated a model 20 minutes ago
DevQuasar/deepseek-ai.DeepSeek-R1-Zero-bf16
View all activity

Organizations

Zillow's profile picture DevQuasar's profile picture Hugging Face Party @ PyTorch Conference's profile picture Intelligent Estate's profile picture open/ acc's profile picture

csabakecskemeti's activity

replied to their post 37 minutes ago
view reply

Here is the full result or the re-executed evaluation on deepseek-ai/DeepSeek-R1-Distill-Llama-8B with the suggested gen args.

mytable2.png

I see some marginal changes in the scores but not much. If this is true the original Llama 3.1 8B wins more test than the Deepseek R1 distilled. I'm not sure what is going on. If anyone can perform the eval, please share your results.
Again I can be totally wrong here.

Full result data (results with 2025-01-26 date)
https://github.com/csabakecskemeti/lm_eval_results/blob/main/deepseek-ai__DeepSeek-R1-Distill-Llama-8B/results_2025-01-26T22-29-00.931915.json

Eval command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag,leaderboard_gpqa,leaderboard_ifeval,leaderboard_math_hard,leaderboard_mmlu_pro,leaderboard_musr,leaderboard_bbh --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,do_sample=True

Eval output:
hf (pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype=float16), gen_kwargs: (temperature=0.6,top_p=0.95,do_sample=True), limit: None, num_fewshot: None, batch_size: auto:4 (1,16,64,64)

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc ↑ 0.5559 ± 0.0050
none 0 acc_norm ↑ 0.7436 ± 0.0044
leaderboard_bbh N/A
- leaderboard_bbh_boolean_expressions 1 none 3 acc_norm ↑ 0.8080 ± 0.0250
- leaderboard_bbh_causal_judgement 1 none 3 acc_norm ↑ 0.5508 ± 0.0365
- leaderboard_bbh_date_understanding 1 none 3 acc_norm ↑ 0.4240 ± 0.0313
- leaderboard_bbh_disambiguation_qa 1 none 3 acc_norm ↑ 0.2240 ± 0.0264
- leaderboard_bbh_formal_fallacies 1 none 3 acc_norm ↑ 0.5200 ± 0.0317
- leaderboard_bbh_geometric_shapes 1 none 3 acc_norm ↑ 0.2360 ± 0.0269
- leaderboard_bbh_hyperbaton 1 none 3 acc_norm ↑ 0.4840 ± 0.0317
- leaderboard_bbh_logical_deduction_five_objects 1 none 3 acc_norm ↑ 0.3240 ± 0.0297
- leaderboard_bbh_logical_deduction_seven_objects 1 none 3 acc_norm ↑ 0.4200 ± 0.0313
- leaderboard_bbh_logical_deduction_three_objects 1 none 3 acc_norm ↑ 0.4040 ± 0.0311
- leaderboard_bbh_movie_recommendation 1 none 3 acc_norm ↑ 0.6880 ± 0.0294
- leaderboard_bbh_navigate 1 none 3 acc_norm ↑ 0.6240 ± 0.0307
- leaderboard_bbh_object_counting 1 none 3 acc_norm ↑ 0.4040 ± 0.0311
- leaderboard_bbh_penguins_in_a_table 1 none 3 acc_norm ↑ 0.2945 ± 0.0379
- leaderboard_bbh_reasoning_about_colored_objects 1 none 3 acc_norm ↑ 0.4120 ± 0.0312
- leaderboard_bbh_ruin_names 1 none 3 acc_norm ↑ 0.4600 ± 0.0316
- leaderboard_bbh_salient_translation_error_detection 1 none 3 acc_norm ↑ 0.3440 ± 0.0301
- leaderboard_bbh_snarks 1 none 3 acc_norm ↑ 0.5112 ± 0.0376
- leaderboard_bbh_sports_understanding 1 none 3 acc_norm ↑ 0.4880 ± 0.0317
- leaderboard_bbh_temporal_sequences 1 none 3 acc_norm ↑ 0.2080 ± 0.0257
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1 none 3 acc_norm ↑ 0.1800 ± 0.0243
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1 none 3 acc_norm ↑ 0.1040 ± 0.0193
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1 none 3 acc_norm ↑ 0.3400 ± 0.0300
- leaderboard_bbh_web_of_lies 1 none 3 acc_norm ↑ 0.4880 ± 0.0317
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm ↑ 0.2879 ± 0.0323
- leaderboard_gpqa_extended 1 none 0 acc_norm ↑ 0.3004 ± 0.0196
- leaderboard_gpqa_main 1 none 0 acc_norm ↑ 0.3036 ± 0.0217
leaderboard_ifeval 3 none 0 inst_level_loose_acc ↑ 0.4556 ± N/A
none 0 inst_level_strict_acc ↑ 0.4400 ± N/A
none 0 prompt_level_loose_acc ↑ 0.3087 ± 0.0199
none 0 prompt_level_strict_acc ↑ 0.2957 ± 0.0196
leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 2 none 4 exact_match ↑ 0.4821 ± 0.0286
- leaderboard_math_counting_and_prob_hard 2 none 4 exact_match ↑ 0.2033 ± 0.0364
- leaderboard_math_geometry_hard 2 none 4 exact_match ↑ 0.2197 ± 0.0362
- leaderboard_math_intermediate_algebra_hard 2 none 4 exact_match ↑ 0.0750 ± 0.0158
- leaderboard_math_num_theory_hard 2 none 4 exact_match ↑ 0.4026 ± 0.0396
- leaderboard_math_prealgebra_hard 2 none 4 exact_match ↑ 0.4508 ± 0.0359
- leaderboard_math_precalculus_hard 2 none 4 exact_match ↑ 0.0963 ± 0.0255
leaderboard_mmlu_pro 0.1 none 5 acc ↑ 0.2741 ± 0.0041
leaderboard_musr N/A
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm ↑ 0.5200 ± 0.0317
- leaderboard_musr_object_placements 1 none 0 acc_norm ↑ 0.3086 ± 0.0289
- leaderboard_musr_team_allocation 1 none 0 acc_norm ↑ 0.3120 ± 0.0294
replied to their post 1 day ago
view reply

I've rerun hellaswag with the suggested config, the results haven't improved:

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc ↑ 0.5559 ± 0.0050
none 0 acc_norm ↑ 0.7436 ± 0.0044

command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,generate_until=64,do_sample=True

replied to their post 1 day ago
view reply

I've missed this suggested configuration from the model card:
"For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1."

Thanks for @shb777 and @bin110 to pointing this out!

replied to their post 1 day ago
replied to their post 1 day ago