ArkaAbacus commited on
Commit
1c49031
·
verified ·
1 Parent(s): 15c84a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -128
README.md CHANGED
@@ -11,10 +11,10 @@ Note: These results are with corrected parsing for BBH from Eleuther's [lm-evalu
11
 
12
  #### Overall:
13
 
14
- | Model | Groups | Version | Filter | n-shot | Metric | | Value | | Stderr |
15
- |----------------------------|--------|---------|------------|--------|-------------|---|--------|---|--------|
16
- | Smaug-Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | ↑ | 0.8241 | ± | 0.0042 |
17
- | Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | ↑ | 0.8036 | ± | 0.0044 |
18
 
19
  #### Breakdown:
20
 
@@ -53,36 +53,79 @@ Smaug-Qwen2-72B-Instruct:
53
 
54
  Qwen2-72B-Instruct:
55
 
56
- | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
57
- |----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|
58
- |bbh |N/A |get-answer| 3|exact_match|↑ |0.8036|0.0044|
59
- | - bbh_cot_fewshot_boolean_expressions | 2|get-answer| 3|exact_match|↑ |0.9640|0.0118|
60
- | - bbh_cot_fewshot_causal_judgement | 2|get-answer| 3|exact_match|↑ |0.6684|0.0345|
61
- | - bbh_cot_fewshot_date_understanding | 2|get-answer| 3|exact_match|↑ |0.8000|0.0253|
62
- | - bbh_cot_fewshot_disambiguation_qa | 2|get-answer| 3|exact_match|↑ |0.8360|0.0235|
63
- | - bbh_cot_fewshot_dyck_languages | 2|get-answer| 3|exact_match|↑ |0.3040|0.0292|
64
- | - bbh_cot_fewshot_formal_fallacies | 2|get-answer| 3|exact_match|↑ |0.7480|0.0275|
65
- | - bbh_cot_fewshot_geometric_shapes | 2|get-answer| 3|exact_match|↑ |0.4960|0.0317|
66
- | - bbh_cot_fewshot_hyperbaton | 2|get-answer| 3|exact_match|↑ |0.9440|0.0146|
67
- | - bbh_cot_fewshot_logical_deduction_five_objects | 2|get-answer| 3|exact_match|↑ |0.6800|0.0296|
68
- | - bbh_cot_fewshot_logical_deduction_seven_objects | 2|get-answer| 3|exact_match|↑ |0.4720|0.0316|
69
- | - bbh_cot_fewshot_logical_deduction_three_objects | 2|get-answer| 3|exact_match|↑ |0.9200|0.0172|
70
- | - bbh_cot_fewshot_movie_recommendation | 2|get-answer| 3|exact_match|↑ |0.7800|0.0263|
71
- | - bbh_cot_fewshot_multistep_arithmetic_two | 2|get-answer| 3|exact_match|↑ |0.9760|0.0097|
72
- | - bbh_cot_fewshot_navigate | 2|get-answer| 3|exact_match|↑ |0.9520|0.0135|
73
- | - bbh_cot_fewshot_object_counting | 2|get-answer| 3|exact_match|↑ |0.9480|0.0141|
74
- | - bbh_cot_fewshot_penguins_in_a_table | 2|get-answer| 3|exact_match|↑ |0.5753|0.0410|
75
- | - bbh_cot_fewshot_reasoning_about_colored_objects | 2|get-answer| 3|exact_match|↑ |0.8120|0.0248|
76
- | - bbh_cot_fewshot_ruin_names | 2|get-answer| 3|exact_match|↑ |0.8760|0.0209|
77
- | - bbh_cot_fewshot_salient_translation_error_detection | 2|get-answer| 3|exact_match|↑ |0.5880|0.0312|
78
- | - bbh_cot_fewshot_snarks | 2|get-answer| 3|exact_match|↑ |0.8764|0.0247|
79
- | - bbh_cot_fewshot_sports_understanding | 2|get-answer| 3|exact_match|↑ |0.9080|0.0183|
80
- | - bbh_cot_fewshot_temporal_sequences | 2|get-answer| 3|exact_match|↑ |0.9960|0.0040|
81
- | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 2|get-answer| 3|exact_match|↑ |0.9160|0.0176|
82
- | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 2|get-answer| 3|exact_match|↑ |0.9400|0.0151|
83
- | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 2|get-answer| 3|exact_match|↑ |0.9440|0.0146|
84
- | - bbh_cot_fewshot_web_of_lies | 2|get-answer| 3|exact_match|↑ |1.0000|0.0000|
85
- | - bbh_cot_fewshot_word_sorting | 2|get-answer| 3|exact_match|↑ |0.6680|0.0298|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  # Model Card for Model ID
88
 
@@ -181,100 +224,6 @@ Use the code below to get started with the model.
181
 
182
  [More Information Needed]
183
 
184
- ## Evaluation
185
-
186
- <!-- This section describes the evaluation protocols and provides the results. -->
187
-
188
- ### Testing Data, Factors & Metrics
189
-
190
- #### Testing Data
191
-
192
- <!-- This should link to a Dataset Card if possible. -->
193
-
194
- [More Information Needed]
195
-
196
- #### Factors
197
-
198
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
199
-
200
- [More Information Needed]
201
-
202
- #### Metrics
203
-
204
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
205
-
206
- [More Information Needed]
207
-
208
- ### Results
209
-
210
- [More Information Needed]
211
-
212
- #### Summary
213
-
214
-
215
-
216
- ## Model Examination [optional]
217
-
218
- <!-- Relevant interpretability work for the model goes here -->
219
-
220
- [More Information Needed]
221
-
222
- ## Environmental Impact
223
-
224
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
225
-
226
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
227
-
228
- - **Hardware Type:** [More Information Needed]
229
- - **Hours used:** [More Information Needed]
230
- - **Cloud Provider:** [More Information Needed]
231
- - **Compute Region:** [More Information Needed]
232
- - **Carbon Emitted:** [More Information Needed]
233
-
234
- ## Technical Specifications [optional]
235
-
236
- ### Model Architecture and Objective
237
-
238
- [More Information Needed]
239
-
240
- ### Compute Infrastructure
241
-
242
- [More Information Needed]
243
-
244
- #### Hardware
245
-
246
- [More Information Needed]
247
-
248
- #### Software
249
-
250
- [More Information Needed]
251
-
252
- ## Citation [optional]
253
-
254
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
255
-
256
- **BibTeX:**
257
-
258
- [More Information Needed]
259
-
260
- **APA:**
261
-
262
- [More Information Needed]
263
-
264
- ## Glossary [optional]
265
-
266
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
267
 
268
- [More Information Needed]
269
-
270
- ## More Information [optional]
271
-
272
- [More Information Needed]
273
-
274
- ## Model Card Authors [optional]
275
-
276
- [More Information Needed]
277
 
278
- ## Model Card Contact
279
 
280
- [More Information Needed]
 
11
 
12
  #### Overall:
13
 
14
+ | Model | Groups | Version | Filter | n-shot | Metric | Value | | Stderr |
15
+ |----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
16
+ | Smaug-Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8241 | ± | 0.0042 |
17
+ | Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8036 | ± | 0.0044 |
18
 
19
  #### Breakdown:
20
 
 
53
 
54
  Qwen2-72B-Instruct:
55
 
56
+ | Tasks | Version | Filter | n-shot | Metric | Value | Stderr |
57
+ |-----------------------------------------------------------|---------|------------|--------|-------------|--------|--------|
58
+ | bbh | N/A | get-answer | 3 | exact_match | 0.8036 | 0.0044 |
59
+ | - bbh_cot_fewshot_boolean_expressions | 2 | get-answer | 3 | exact_match | 0.9640 | 0.0118 |
60
+ | - bbh_cot_fewshot_causal_judgement | 2 | get-answer | 3 | exact_match | 0.6684 | 0.0345 |
61
+ | - bbh_cot_fewshot_date_understanding | 2 | get-answer | 3 | exact_match | 0.8000 | 0.0253 |
62
+ | - bbh_cot_fewshot_disambiguation_qa | 2 | get-answer | 3 | exact_match | 0.8360 | 0.0235 |
63
+ | - bbh_cot_fewshot_dyck_languages | 2 | get-answer | 3 | exact_match | 0.3040 | 0.0292 |
64
+ | - bbh_cot_fewshot_formal_fallacies | 2 | get-answer | 3 | exact_match | 0.7480 | 0.0275 |
65
+ | - bbh_cot_fewshot_geometric_shapes | 2 | get-answer | 3 | exact_match | 0.4960 | 0.0317 |
66
+ | - bbh_cot_fewshot_hyperbaton | 2 | get-answer | 3 | exact_match | 0.9440 | 0.0146 |
67
+ | - bbh_cot_fewshot_logical_deduction_five_objects | 2 | get-answer | 3 | exact_match | 0.6800 | 0.0296 |
68
+ | - bbh_cot_fewshot_logical_deduction_seven_objects | 2 | get-answer | 3 | exact_match | 0.4720 | 0.0316 |
69
+ | - bbh_cot_fewshot_logical_deduction_three_objects | 2 | get-answer | 3 | exact_match | 0.9200 | 0.0172 |
70
+ | - bbh_cot_fewshot_movie_recommendation | 2 | get-answer | 3 | exact_match | 0.7800 | 0.0263 |
71
+ | - bbh_cot_fewshot_multistep_arithmetic_two | 2 | get-answer | 3 | exact_match | 0.9760 | 0.0097 |
72
+ | - bbh_cot_fewshot_navigate | 2 | get-answer | 3 | exact_match | 0.9520 | 0.0135 |
73
+ | - bbh_cot_fewshot_object_counting | 2 | get-answer | 3 | exact_match | 0.9480 | 0.0141 |
74
+ | - bbh_cot_fewshot_penguins_in_a_table | 2 | get-answer | 3 | exact_match | 0.5753 | 0.0410 |
75
+ | - bbh_cot_fewshot_reasoning_about_colored_objects | 2 | get-answer | 3 | exact_match | 0.8120 | 0.0248 |
76
+ | - bbh_cot_fewshot_ruin_names | 2 | get-answer | 3 | exact_match | 0.8760 | 0.0209 |
77
+ | - bbh_cot_fewshot_salient_translation_error_detection | 2 | get-answer | 3 | exact_match | 0.5880 | 0.0312 |
78
+ | - bbh_cot_fewshot_snarks | 2 | get-answer | 3 | exact_match | 0.8764 | 0.0247 |
79
+ | - bbh_cot_fewshot_sports_understanding | 2 | get-answer | 3 | exact_match | 0.9080 | 0.0183 |
80
+ | - bbh_cot_fewshot_temporal_sequences | 2 | get-answer | 3 | exact_match | 0.9960 | 0.0040 |
81
+ | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 2 | get-answer | 3 | exact_match | 0.9160 | 0.0176 |
82
+ | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects | 2 | get-answer | 3 | exact_match | 0.9400 | 0.0151 |
83
+ | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects | 2 | get-answer | 3 | exact_match | 0.9440 | 0.0146 |
84
+ | - bbh_cot_fewshot_web_of_lies | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
85
+ | - bbh_cot_fewshot_word_sorting | 2 | get-answer | 3 | exact_match | 0.6680 | 0.0298 |
86
+
87
+ ## Arena-Hard
88
+
89
+ Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.
90
+
91
+ | Model | Score | 95% Confidence Interval | Average Tokens |
92
+ | :---- | ---------: | ----------: | ------: |
93
+ | GPT-4-Turbo-2024-04-09 | 82.6 | (-1.8, 1.6) | 662 |
94
+ | GPT-4o | 78.3 | (-2.4, 2.1) | 685 |
95
+ | Gemini-1.5-pro-latest | 72.1 | (-2.3, 2.2) | 630 |
96
+ | Claude-3-Opus-20240229 | 60.4 | (-3.3, 2.4) | 541 |
97
+ | Smaug-Llama-3-70B-Instruct | 56.7 | (-2.2, 2.6) | 661 |
98
+ | GPT-4-0314 | 50.0 | (-0.0, 0.0) | 423 |
99
+ | Smaug-Qwen2-72B-Instruct | score: 48.0 | (-1.8, 2.1) | 628 |
100
+ | Claude-3-Sonnet-20240229 | 46.8 | (-2.1, 2.2) | 552 |
101
+ | Qwen2-72B-Instruct | score: 43.5 | (-2.6, 2.7) | 531 |
102
+ | Llama-3-70B-Instruct | 41.1 | (-2.5, 2.4) | 583 |
103
+ | GPT-4-0613 | 37.9 | (-2.2, 2.0) | 354 |
104
+ | Mistral-Large-2402 | 37.7 | (-1.9, 2.6) | 400 |
105
+ | Mixtral-8x22B-Instruct-v0.1 | 36.4 | (-2.7, 2.9) | 430 |
106
+ | Qwen1.5-72B-Chat | 36.1 | (-2.5, 2.2) | 474 |
107
+ | Command-R-Plus | 33.1 | (-2.1, 2.2) | 541 |
108
+ | Mistral-Medium | 31.9 | (-2.3, 2.4) | 485 |
109
+ | GPT-3.5-Turbo-0613 | 24.8 | (-1.6, 2.0) | 401 |
110
+
111
+ ## MT-Bench
112
+
113
+ ########## First turn ##########
114
+ score
115
+ model turn
116
+ Qwen2-72B-Instruct 1 9.18125
117
+ Smaug-Qwen2-72B-Instruct 1 9.05625
118
+ ########## Second turn ##########
119
+ score
120
+ model turn
121
+ Qwen2-72B-Instruct 2 8.74684
122
+ Smaug-Qwen2-72B-Instruct 2 8.67500
123
+ ########## Average ##########
124
+ score
125
+ model
126
+ Qwen2-72B-Instruct 8.96541
127
+ Smaug-Qwen2-72B-Instruct 8.86563
128
+
129
 
130
  # Model Card for Model ID
131
 
 
224
 
225
  [More Information Needed]
226
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
 
 
 
 
 
 
 
 
 
 
228
 
 
229