abacusai
/

Smaug-Qwen2-72B-Instruct

@@ -20,36 +20,36 @@ Note: These results are with corrected parsing for BBH from Eleuther's [lm-evalu
 Smaug-Qwen2-72B-Instruct:
-|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
-|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|
-|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.8241|±  |0.0042|
-| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
-| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.6578|±  |0.0348|
-| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.8360|±  |0.0235|
-| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.8280|±  |0.0239|
-| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.3360|±  |0.0299|
-| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.7120|±  |0.0287|
-| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.5320|±  |0.0316|
-| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.9880|±  |0.0069|
-| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.7680|±  |0.0268|
-| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.5360|±  |0.0316|
-| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.9720|±  |0.0105|
-| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.8000|±  |0.0253|
-| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.9720|±  |0.0105|
-| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
-| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.9200|±  |0.0172|
-| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.8493|±  |0.0297|
-| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.7560|±  |0.0272|
-| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.8520|±  |0.0225|
-| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.5920|±  |0.0311|
-| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.9101|±  |0.0215|
-| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
-| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
-| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.9800|±  |0.0089|
-| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.9560|±  |0.0130|
-| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
-| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
-| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.6560|±  |0.0301|
 Qwen2-72B-Instruct:
@@ -96,9 +96,9 @@ Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena
 | Claude-3-Opus-20240229 | 60.4  | (-3.3, 2.4)  | 541 |
 | Smaug-Llama-3-70B-Instruct | 56.7  | (-2.2, 2.6)  | 661 |
 | GPT-4-0314 | 50.0  | (-0.0, 0.0)  | 423 |
-| Smaug-Qwen2-72B-Instruct | score: 48.0  | (-1.8, 2.1)  | 628 |
 | Claude-3-Sonnet-20240229 | 46.8  | (-2.1, 2.2)  | 552 |
-| Qwen2-72B-Instruct | score: 43.5  | (-2.6, 2.7)  | 531 |
 | Llama-3-70B-Instruct | 41.1  | (-2.5, 2.4)  | 583 |
 | GPT-4-0613 | 37.9  | (-2.2, 2.0)  | 354 |
 | Mistral-Large-2402 | 37.7 | (-1.9, 2.6)  | 400 |
@@ -110,21 +110,26 @@ Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena
 ## MT-Bench
-########## First turn ##########
-                   score
-model             turn
-Qwen2-72B-Instruct         1     9.18125
-Smaug-Qwen2-72B-Instruct   1     9.05625
-########## Second turn ##########
-                   score
-model             turn
-Qwen2-72B-Instruct         2     8.74684
-Smaug-Qwen2-72B-Instruct   2     8.67500
-########## Average ##########
-                 score
-model
-Qwen2-72B-Instruct                8.96541
-Smaug-Qwen2-72B-Instruct          8.86563
 # Model Card for Model ID

 Smaug-Qwen2-72B-Instruct:
+| Tasks                                                     | Version | Filter     | n-shot | Metric      | Value  | Stderr |
+|-----------------------------------------------------------|---------|------------|--------|-------------|--------|--------|
+| bbh                                                       | N/A     | get-answer | 3      | exact_match | 0.8241 | 0.0042 |
+| - bbh_cot_fewshot_boolean_expressions                     | 2       | get-answer | 3      | exact_match | 0.9640 | 0.0118 |
+| - bbh_cot_fewshot_causal_judgement                        | 2       | get-answer | 3      | exact_match | 0.6578 | 0.0348 |
+| - bbh_cot_fewshot_date_understanding                      | 2       | get-answer | 3      | exact_match | 0.8360 | 0.0235 |
+| - bbh_cot_fewshot_disambiguation_qa                       | 2       | get-answer | 3      | exact_match | 0.8280 | 0.0239 |
+| - bbh_cot_fewshot_dyck_languages                          | 2       | get-answer | 3      | exact_match | 0.3360 | 0.0299 |
+| - bbh_cot_fewshot_formal_fallacies                        | 2       | get-answer | 3      | exact_match | 0.7120 | 0.0287 |
+| - bbh_cot_fewshot_geometric_shapes                        | 2       | get-answer | 3      | exact_match | 0.5320 | 0.0316 |
+| - bbh_cot_fewshot_hyperbaton                              | 2       | get-answer | 3      | exact_match | 0.9880 | 0.0069 |
+| - bbh_cot_fewshot_logical_deduction_five_objects          | 2       | get-answer | 3      | exact_match | 0.7680 | 0.0268 |
+| - bbh_cot_fewshot_logical_deduction_seven_objects         | 2       | get-answer | 3      | exact_match | 0.5360 | 0.0316 |
+| - bbh_cot_fewshot_logical_deduction_three_objects         | 2       | get-answer | 3      | exact_match | 0.9720 | 0.0105 |
+| - bbh_cot_fewshot_movie_recommendation                    | 2       | get-answer | 3      | exact_match | 0.8000 | 0.0253 |
+| - bbh_cot_fewshot_multistep_arithmetic_two                | 2       | get-answer | 3      | exact_match | 0.9720 | 0.0105 |
+| - bbh_cot_fewshot_navigate                                | 2       | get-answer | 3      | exact_match | 0.9640 | 0.0118 |
+| - bbh_cot_fewshot_object_counting                         | 2       | get-answer | 3      | exact_match | 0.9200 | 0.0172 |
+| - bbh_cot_fewshot_penguins_in_a_table                     | 2       | get-answer | 3      | exact_match | 0.8493 | 0.0297 |
+| - bbh_cot_fewshot_reasoning_about_colored_objects         | 2       | get-answer | 3      | exact_match | 0.7560 | 0.0272 |
+| - bbh_cot_fewshot_ruin_names                              | 2       | get-answer | 3      | exact_match | 0.8520 | 0.0225 |
+| - bbh_cot_fewshot_salient_translation_error_detection     | 2       | get-answer | 3      | exact_match | 0.5920 | 0.0311 |
+| - bbh_cot_fewshot_snarks                                  | 2       | get-answer | 3      | exact_match | 0.9101 | 0.0215 |
+| - bbh_cot_fewshot_sports_understanding                    | 2       | get-answer | 3      | exact_match | 0.9440 | 0.0146 |
+| - bbh_cot_fewshot_temporal_sequences                      | 2       | get-answer | 3      | exact_match | 1.0000 | 0.0000 |
+| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects  | 2       | get-answer | 3      | exact_match | 0.9800 | 0.0089 |
+| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects | 2       | get-answer | 3      | exact_match | 0.9560 | 0.0130 |
+| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects | 2       | get-answer | 3      | exact_match | 0.9640 | 0.0118 |
+| - bbh_cot_fewshot_web_of_lies                             | 2       | get-answer | 3      | exact_match | 1.0000 | 0.0000 |
+| - bbh_cot_fewshot_word_sorting                            | 2       | get-answer | 3      | exact_match | 0.6560 | 0.0301 |
 Qwen2-72B-Instruct:
 | Claude-3-Opus-20240229 | 60.4  | (-3.3, 2.4)  | 541 |
 | Smaug-Llama-3-70B-Instruct | 56.7  | (-2.2, 2.6)  | 661 |
 | GPT-4-0314 | 50.0  | (-0.0, 0.0)  | 423 |
+| Smaug-Qwen2-72B-Instruct | 48.0  | (-1.8, 2.1)  | 628 |
 | Claude-3-Sonnet-20240229 | 46.8  | (-2.1, 2.2)  | 552 |
+| Qwen2-72B-Instruct | 43.5  | (-2.6, 2.7)  | 531 |
 | Llama-3-70B-Instruct | 41.1  | (-2.5, 2.4)  | 583 |
 | GPT-4-0613 | 37.9  | (-2.2, 2.0)  | 354 |
 | Mistral-Large-2402 | 37.7 | (-1.9, 2.6)  | 400 |
 ## MT-Bench
+First turn
+| Model                    | Turn | Score   |
+|--------------------------|------|---------|
+| Qwen2-72B-Instruct       | 1    | 9.18125 |
+| Smaug-Qwen2-72B-Instruct | 1    | 9.05625 |
+Second turn
+| Model                    | Turn | Score   |
+|--------------------------|------|---------|
+| Qwen2-72B-Instruct       | 2    | 8.74684 |
+| Smaug-Qwen2-72B-Instruct | 2    | 8.67500 |
+Average
+| Model                    | Score   |
+|--------------------------|---------|
+| Qwen2-72B-Instruct       | 8.96541 |
+| Smaug-Qwen2-72B-Instruct | 8.86563 |
 # Model Card for Model ID