MMLU 0-shot
#9
by
vince62s
- opened
This: lm_eval --model hf --model_args pretrained=tencent/Hunyuan-7B-Instruct --trust_remote_code --tasks mmlu --device cuda:0 --batch_size 8
Gives this:
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.4704|± |0.0042|
| - humanities | 2|none | |acc |↑ |0.4261|± |0.0071|
| - formal_logic | 1|none | 0|acc |↑ |0.4683|± |0.0446|
| - high_school_european_history | 1|none | 0|acc |↑ |0.6485|± |0.0373|
| - high_school_us_history | 1|none | 0|acc |↑ |0.5588|± |0.0348|
| - high_school_world_history | 1|none | 0|acc |↑ |0.6034|± |0.0318|
| - international_law | 1|none | 0|acc |↑ |0.5537|± |0.0454|
| - jurisprudence | 1|none | 0|acc |↑ |0.5463|± |0.0481|
| - logical_fallacies | 1|none | 0|acc |↑ |0.5031|± |0.0393|
| - moral_disputes | 1|none | 0|acc |↑ |0.4046|± |0.0264|
| - moral_scenarios | 1|none | 0|acc |↑ |0.3263|± |0.0157|
| - philosophy | 1|none | 0|acc |↑ |0.4695|± |0.0283|
| - prehistory | 1|none | 0|acc |↑ |0.4722|± |0.0278|
| - professional_law | 1|none | 0|acc |↑ |0.3748|± |0.0124|
| - world_religions | 1|none | 0|acc |↑ |0.3977|± |0.0375|
| - other | 2|none | |acc |↑ |0.4677|± |0.0089|
| - business_ethics | 1|none | 0|acc |↑ |0.4800|± |0.0502|
| - clinical_knowledge | 1|none | 0|acc |↑ |0.4943|± |0.0308|
| - college_medicine | 1|none | 0|acc |↑ |0.4740|± |0.0381|
| - global_facts | 1|none | 0|acc |↑ |0.2600|± |0.0441|
| - human_aging | 1|none | 0|acc |↑ |0.4484|± |0.0334|
| - management | 1|none | 0|acc |↑ |0.5146|± |0.0495|
| - marketing | 1|none | 0|acc |↑ |0.5769|± |0.0324|
| - medical_genetics | 1|none | 0|acc |↑ |0.4400|± |0.0499|
| - miscellaneous | 1|none | 0|acc |↑ |0.4751|± |0.0179|
| - nutrition | 1|none | 0|acc |↑ |0.5261|± |0.0286|
| - professional_accounting | 1|none | 0|acc |↑ |0.3936|± |0.0291|
| - professional_medicine | 1|none | 0|acc |↑ |0.4301|± |0.0301|
| - virology | 1|none | 0|acc |↑ |0.4398|± |0.0386|
| - social sciences | 2|none | |acc |↑ |0.4979|± |0.0089|
| - econometrics | 1|none | 0|acc |↑ |0.3596|± |0.0451|
| - high_school_geography | 1|none | 0|acc |↑ |0.4747|± |0.0356|
| - high_school_government_and_politics| 1|none | 0|acc |↑ |0.5596|± |0.0358|
| - high_school_macroeconomics | 1|none | 0|acc |↑ |0.4718|± |0.0253|
| - high_school_microeconomics | 1|none | 0|acc |↑ |0.4790|± |0.0324|
| - high_school_psychology | 1|none | 0|acc |↑ |0.5468|± |0.0213|
| - human_sexuality | 1|none | 0|acc |↑ |0.4962|± |0.0439|
| - professional_psychology | 1|none | 0|acc |↑ |0.4052|± |0.0199|
| - public_relations | 1|none | 0|acc |↑ |0.4636|± |0.0478|
| - security_studies | 1|none | 0|acc |↑ |0.6204|± |0.0311|
| - sociology | 1|none | 0|acc |↑ |0.6020|± |0.0346|
| - us_foreign_policy | 1|none | 0|acc |↑ |0.5600|± |0.0499|
| - stem | 2|none | |acc |↑ |0.5122|± |0.0088|
| - abstract_algebra | 1|none | 0|acc |↑ |0.4500|± |0.0500|
| - anatomy | 1|none | 0|acc |↑ |0.4667|± |0.0431|
| - astronomy | 1|none | 0|acc |↑ |0.5395|± |0.0406|
| - college_biology | 1|none | 0|acc |↑ |0.4931|± |0.0418|
| - college_chemistry | 1|none | 0|acc |↑ |0.4600|± |0.0501|
| - college_computer_science | 1|none | 0|acc |↑ |0.4600|± |0.0501|
| - college_mathematics | 1|none | 0|acc |↑ |0.4500|± |0.0500|
| - college_physics | 1|none | 0|acc |↑ |0.4706|± |0.0497|
| - computer_security | 1|none | 0|acc |↑ |0.5000|± |0.0503|
| - conceptual_physics | 1|none | 0|acc |↑ |0.4596|± |0.0326|
| - electrical_engineering | 1|none | 0|acc |↑ |0.4966|± |0.0417|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.5476|± |0.0256|
| - high_school_biology | 1|none | 0|acc |↑ |0.6387|± |0.0273|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.5764|± |0.0348|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.6200|± |0.0488|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.4333|± |0.0302|
| - high_school_physics | 1|none | 0|acc |↑ |0.5033|± |0.0408|
| - high_school_statistics | 1|none | 0|acc |↑ |0.5556|± |0.0339|
| - machine_learning | 1|none | 0|acc |↑ |0.3750|± |0.0460|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.4704|± |0.0042|
| - humanities | 2|none | |acc |↑ |0.4261|± |0.0071|
| - other | 2|none | |acc |↑ |0.4677|± |0.0089|
| - social sciences| 2|none | |acc |↑ |0.4979|± |0.0089|
| - stem | 2|none | |acc |↑ |0.5122|± |0.0088|
I can't run it on the pretrain version because of the tokenizer.
But anyway, it will never give 79
How did you come to 79 for your Pretrain MMLU score ?
I'll run it with 5-shots
EDIT: with 5-shots
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.5249 | ± | 0.0041 | |
- humanities | 2 | none | acc | ↑ | 0.4567 | ± | 0.0071 | |
- other | 2 | none | acc | ↑ | 0.5330 | ± | 0.0088 | |
- social sciences | 2 | none | acc | ↑ | 0.5866 | ± | 0.0088 | |
- stem | 2 | none | acc | ↑ | 0.5582 | ± | 0.0087 |
still far from 79
vince62s
changed discussion title from
MMLU
to MMLU 0-shot