mrs83 commited on
Commit
087ae6a
·
verified ·
1 Parent(s): ed0fc4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -1
README.md CHANGED
@@ -16,7 +16,7 @@ Kurtis E1.1 fine-tuned with [flower](https://flower.ai/)
16
 
17
  ## Eval Results
18
 
19
- Evaluation tasks were performed with the [LM Evaluation Harness] (https://github.com/EleutherAI/lm-evaluation-harness) on an NVIDIA A40.
20
 
21
 
22
  ### hellaswag
@@ -131,3 +131,82 @@ lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Ins
131
  | - other | 2|none | |acc |↑ |0.7087|± |0.0079|
132
  | - social sciences| 2|none | |acc |↑ |0.7618|± |0.0076|
133
  | - stem | 2|none | |acc |↑ |0.6070|± |0.0085|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Eval Results
18
 
19
+ Evaluation tasks were performed with the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) on an NVIDIA A40.
20
 
21
 
22
  ### hellaswag
 
131
  | - other | 2|none | |acc |↑ |0.7087|± |0.0079|
132
  | - social sciences| 2|none | |acc |↑ |0.7618|± |0.0076|
133
  | - stem | 2|none | |acc |↑ |0.6070|± |0.0085|
134
+
135
+ ### mmlu (5-shot)
136
+
137
+ ```
138
+ lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct --tasks mmlu --device cuda:0 --batch_size 8 --num_fewshot 5
139
+ ```
140
+
141
+ | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
142
+ |---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
143
+ |mmlu | 2|none | |acc |↑ |0.6629|± |0.0038|
144
+ | - humanities | 2|none | |acc |↑ |0.5862|± |0.0067|
145
+ | - formal_logic | 1|none | 5|acc |↑ |0.4683|± |0.0446|
146
+ | - high_school_european_history | 1|none | 5|acc |↑ |0.7818|± |0.0323|
147
+ | - high_school_us_history | 1|none | 5|acc |↑ |0.8284|± |0.0265|
148
+ | - high_school_world_history | 1|none | 5|acc |↑ |0.8692|± |0.0219|
149
+ | - international_law | 1|none | 5|acc |↑ |0.7769|± |0.0380|
150
+ | - jurisprudence | 1|none | 5|acc |↑ |0.7963|± |0.0389|
151
+ | - logical_fallacies | 1|none | 5|acc |↑ |0.8098|± |0.0308|
152
+ | - moral_disputes | 1|none | 5|acc |↑ |0.7110|± |0.0244|
153
+ | - moral_scenarios | 1|none | 5|acc |↑ |0.3464|± |0.0159|
154
+ | - philosophy | 1|none | 5|acc |↑ |0.7042|± |0.0259|
155
+ | - prehistory | 1|none | 5|acc |↑ |0.7284|± |0.0247|
156
+ | - professional_law | 1|none | 5|acc |↑ |0.4759|± |0.0128|
157
+ | - world_religions | 1|none | 5|acc |↑ |0.8304|± |0.0288|
158
+ | - other | 2|none | |acc |↑ |0.7171|± |0.0078|
159
+ | - business_ethics | 1|none | 5|acc |↑ |0.7400|± |0.0441|
160
+ | - clinical_knowledge | 1|none | 5|acc |↑ |0.7321|± |0.0273|
161
+ | - college_medicine | 1|none | 5|acc |↑ |0.6647|± |0.0360|
162
+ | - global_facts | 1|none | 5|acc |↑ |0.4100|± |0.0494|
163
+ | - human_aging | 1|none | 5|acc |↑ |0.7220|± |0.0301|
164
+ | - management | 1|none | 5|acc |↑ |0.7864|± |0.0406|
165
+ | - marketing | 1|none | 5|acc |↑ |0.8889|± |0.0206|
166
+ | - medical_genetics | 1|none | 5|acc |↑ |0.7900|± |0.0409|
167
+ | - miscellaneous | 1|none | 5|acc |↑ |0.7957|± |0.0144|
168
+ | - nutrition | 1|none | 5|acc |↑ |0.7680|± |0.0242|
169
+ | - professional_accounting | 1|none | 5|acc |↑ |0.5532|± |0.0297|
170
+ | - professional_medicine | 1|none | 5|acc |↑ |0.6471|± |0.0290|
171
+ | - virology | 1|none | 5|acc |↑ |0.5120|± |0.0389|
172
+ | - social sciences | 2|none | |acc |↑ |0.7735|± |0.0075|
173
+ | - econometrics | 1|none | 5|acc |↑ |0.5877|± |0.0463|
174
+ | - high_school_geography | 1|none | 5|acc |↑ |0.7828|± |0.0294|
175
+ | - high_school_government_and_politics| 1|none | 5|acc |↑ |0.8756|± |0.0238|
176
+ | - high_school_macroeconomics | 1|none | 5|acc |↑ |0.7051|± |0.0231|
177
+ | - high_school_microeconomics | 1|none | 5|acc |↑ |0.7773|± |0.0270|
178
+ | - high_school_psychology | 1|none | 5|acc |↑ |0.8550|± |0.0151|
179
+ | - human_sexuality | 1|none | 5|acc |↑ |0.8092|± |0.0345|
180
+ | - professional_psychology | 1|none | 5|acc |↑ |0.7288|± |0.0180|
181
+ | - public_relations | 1|none | 5|acc |↑ |0.6909|± |0.0443|
182
+ | - security_studies | 1|none | 5|acc |↑ |0.7551|± |0.0275|
183
+ | - sociology | 1|none | 5|acc |↑ |0.8308|± |0.0265|
184
+ | - us_foreign_policy | 1|none | 5|acc |↑ |0.8300|± |0.0378|
185
+ | - stem | 2|none | |acc |↑ |0.6159|± |0.0084|
186
+ | - abstract_algebra | 1|none | 5|acc |↑ |0.5000|± |0.0503|
187
+ | - anatomy | 1|none | 5|acc |↑ |0.6222|± |0.0419|
188
+ | - astronomy | 1|none | 5|acc |↑ |0.7500|± |0.0352|
189
+ | - college_biology | 1|none | 5|acc |↑ |0.7083|± |0.0380|
190
+ | - college_chemistry | 1|none | 5|acc |↑ |0.4700|± |0.0502|
191
+ | - college_computer_science | 1|none | 5|acc |↑ |0.6200|± |0.0488|
192
+ | - college_mathematics | 1|none | 5|acc |↑ |0.4000|± |0.0492|
193
+ | - college_physics | 1|none | 5|acc |↑ |0.4902|± |0.0497|
194
+ | - computer_security | 1|none | 5|acc |↑ |0.8200|± |0.0386|
195
+ | - conceptual_physics | 1|none | 5|acc |↑ |0.6383|± |0.0314|
196
+ | - electrical_engineering | 1|none | 5|acc |↑ |0.6483|± |0.0398|
197
+ | - elementary_mathematics | 1|none | 5|acc |↑ |0.5820|± |0.0254|
198
+ | - high_school_biology | 1|none | 5|acc |↑ |0.8161|± |0.0220|
199
+ | - high_school_chemistry | 1|none | 5|acc |↑ |0.6059|± |0.0344|
200
+ | - high_school_computer_science | 1|none | 5|acc |↑ |0.7500|± |0.0435|
201
+ | - high_school_mathematics | 1|none | 5|acc |↑ |0.4926|± |0.0305|
202
+ | - high_school_physics | 1|none | 5|acc |↑ |0.4702|± |0.0408|
203
+ | - high_school_statistics | 1|none | 5|acc |↑ |0.6343|± |0.0328|
204
+ | - machine_learning | 1|none | 5|acc |↑ |0.4911|± |0.0475|
205
+
206
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
207
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
208
+ |mmlu | 2|none | |acc |↑ |0.6629|± |0.0038|
209
+ | - humanities | 2|none | |acc |↑ |0.5862|± |0.0067|
210
+ | - other | 2|none | |acc |↑ |0.7171|± |0.0078|
211
+ | - social sciences| 2|none | |acc |↑ |0.7735|± |0.0075|
212
+ | - stem | 2|none | |acc |↑ |0.6159|± |0.0084|