abacusai
/

Smaug-Qwen2-72B-Instruct

@@ -11,10 +11,10 @@ Note: These results are with corrected parsing for BBH from Eleuther's [lm-evalu
 #### Overall:
-| Model                      | Groups | Version | Filter     | n-shot | Metric      |   | Value  |   | Stderr |
-|----------------------------|--------|---------|------------|--------|-------------|---|--------|---|--------|
-| Smaug-Qwen2-72B-Instruct   | bbh    | N/A     | get-answer | 3      | exact_match | ↑ | 0.8241 | ± | 0.0042 |
-| Qwen2-72B-Instruct         | bbh    | N/A     | get-answer | 3      | exact_match | ↑ | 0.8036 | ± | 0.0044 |
 #### Breakdown:
@@ -53,36 +53,79 @@ Smaug-Qwen2-72B-Instruct:
 Qwen2-72B-Instruct:
-|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
-|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|
-|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.8036|±  |0.0044|
-| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
-| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.6684|±  |0.0345|
-| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.8000|±  |0.0253|
-| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.8360|±  |0.0235|
-| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.3040|±  |0.0292|
-| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.7480|±  |0.0275|
-| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.4960|±  |0.0317|
-| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
-| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.6800|±  |0.0296|
-| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.4720|±  |0.0316|
-| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.9200|±  |0.0172|
-| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.7800|±  |0.0263|
-| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.9760|±  |0.0097|
-| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.9520|±  |0.0135|
-| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.9480|±  |0.0141|
-| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.5753|±  |0.0410|
-| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.8120|±  |0.0248|
-| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.8760|±  |0.0209|
-| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.5880|±  |0.0312|
-| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.8764|±  |0.0247|
-| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.9080|±  |0.0183|
-| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |0.9960|±  |0.0040|
-| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.9160|±  |0.0176|
-| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.9400|±  |0.0151|
-| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
-| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
-| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.6680|±  |0.0298|
 # Model Card for Model ID
@@ -181,100 +224,6 @@ Use the code below to get started with the model.
 [More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 #### Overall:
+| Model                      | Groups | Version | Filter     | n-shot | Metric      | Value  |   | Stderr |
+|----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
+| Smaug-Qwen2-72B-Instruct   | bbh    | N/A     | get-answer | 3      | exact_match | 0.8241 | ± | 0.0042 |
+| Qwen2-72B-Instruct         | bbh    | N/A     | get-answer | 3      | exact_match | 0.8036 | ± | 0.0044 |
 #### Breakdown:
 Qwen2-72B-Instruct:
+| Tasks                                                     | Version | Filter     | n-shot | Metric      | Value  | Stderr |
+|-----------------------------------------------------------|---------|------------|--------|-------------|--------|--------|
+| bbh                                                       | N/A     | get-answer | 3      | exact_match | 0.8036 | 0.0044 |
+| - bbh_cot_fewshot_boolean_expressions                     | 2       | get-answer | 3      | exact_match | 0.9640 | 0.0118 |
+| - bbh_cot_fewshot_causal_judgement                        | 2       | get-answer | 3      | exact_match | 0.6684 | 0.0345 |
+| - bbh_cot_fewshot_date_understanding                      | 2       | get-answer | 3      | exact_match | 0.8000 | 0.0253 |
+| - bbh_cot_fewshot_disambiguation_qa                       | 2       | get-answer | 3      | exact_match | 0.8360 | 0.0235 |
+| - bbh_cot_fewshot_dyck_languages                          | 2       | get-answer | 3      | exact_match | 0.3040 | 0.0292 |
+| - bbh_cot_fewshot_formal_fallacies                        | 2       | get-answer | 3      | exact_match | 0.7480 | 0.0275 |
+| - bbh_cot_fewshot_geometric_shapes                        | 2       | get-answer | 3      | exact_match | 0.4960 | 0.0317 |
+| - bbh_cot_fewshot_hyperbaton                              | 2       | get-answer | 3      | exact_match | 0.9440 | 0.0146 |
+| - bbh_cot_fewshot_logical_deduction_five_objects          | 2       | get-answer | 3      | exact_match | 0.6800 | 0.0296 |
+| - bbh_cot_fewshot_logical_deduction_seven_objects         | 2       | get-answer | 3      | exact_match | 0.4720 | 0.0316 |
+| - bbh_cot_fewshot_logical_deduction_three_objects         | 2       | get-answer | 3      | exact_match | 0.9200 | 0.0172 |
+| - bbh_cot_fewshot_movie_recommendation                    | 2       | get-answer | 3      | exact_match | 0.7800 | 0.0263 |
+| - bbh_cot_fewshot_multistep_arithmetic_two                | 2       | get-answer | 3      | exact_match | 0.9760 | 0.0097 |
+| - bbh_cot_fewshot_navigate                                | 2       | get-answer | 3      | exact_match | 0.9520 | 0.0135 |
+| - bbh_cot_fewshot_object_counting                         | 2       | get-answer | 3      | exact_match | 0.9480 | 0.0141 |
+| - bbh_cot_fewshot_penguins_in_a_table                     | 2       | get-answer | 3      | exact_match | 0.5753 | 0.0410 |
+| - bbh_cot_fewshot_reasoning_about_colored_objects         | 2       | get-answer | 3      | exact_match | 0.8120 | 0.0248 |
+| - bbh_cot_fewshot_ruin_names                              | 2       | get-answer | 3      | exact_match | 0.8760 | 0.0209 |
+| - bbh_cot_fewshot_salient_translation_error_detection     | 2       | get-answer | 3      | exact_match | 0.5880 | 0.0312 |
+| - bbh_cot_fewshot_snarks                                  | 2       | get-answer | 3      | exact_match | 0.8764 | 0.0247 |
+| - bbh_cot_fewshot_sports_understanding                    | 2       | get-answer | 3      | exact_match | 0.9080 | 0.0183 |
+| - bbh_cot_fewshot_temporal_sequences                      | 2       | get-answer | 3      | exact_match | 0.9960 | 0.0040 |
+| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects  | 2       | get-answer | 3      | exact_match | 0.9160 | 0.0176 |
+| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects | 2       | get-answer | 3      | exact_match | 0.9400 | 0.0151 |
+| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects | 2       | get-answer | 3      | exact_match | 0.9440 | 0.0146 |
+| - bbh_cot_fewshot_web_of_lies                             | 2       | get-answer | 3      | exact_match | 1.0000 | 0.0000 |
+| - bbh_cot_fewshot_word_sorting                            | 2       | get-answer | 3      | exact_match | 0.6680 | 0.0298 |
+## Arena-Hard
+Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.
+| Model | Score | 95% Confidence Interval | Average Tokens |
+| :---- | ---------: | ----------: | ------: |
+| GPT-4-Turbo-2024-04-09 | 82.6  | (-1.8, 1.6)  | 662 |
+| GPT-4o | 78.3  | (-2.4, 2.1)  | 685 |
+| Gemini-1.5-pro-latest | 72.1  | (-2.3, 2.2)  | 630 |
+| Claude-3-Opus-20240229 | 60.4  | (-3.3, 2.4)  | 541 |
+| Smaug-Llama-3-70B-Instruct | 56.7  | (-2.2, 2.6)  | 661 |
+| GPT-4-0314 | 50.0  | (-0.0, 0.0)  | 423 |
+| Smaug-Qwen2-72B-Instruct | score: 48.0  | (-1.8, 2.1)  | 628 |
+| Claude-3-Sonnet-20240229 | 46.8  | (-2.1, 2.2)  | 552 |
+| Qwen2-72B-Instruct | score: 43.5  | (-2.6, 2.7)  | 531 |
+| Llama-3-70B-Instruct | 41.1  | (-2.5, 2.4)  | 583 |
+| GPT-4-0613 | 37.9  | (-2.2, 2.0)  | 354 |
+| Mistral-Large-2402 | 37.7 | (-1.9, 2.6)  | 400 |
+| Mixtral-8x22B-Instruct-v0.1 | 36.4  | (-2.7, 2.9)  | 430 |
+| Qwen1.5-72B-Chat | 36.1 | (-2.5, 2.2)  | 474 |
+| Command-R-Plus | 33.1 | (-2.1, 2.2)  | 541 |
+| Mistral-Medium | 31.9  | (-2.3, 2.4)  | 485 |
+| GPT-3.5-Turbo-0613 | 24.8 | (-1.6, 2.0)  | 401 |
+## MT-Bench
+########## First turn ##########
+                   score
+model             turn
+Qwen2-72B-Instruct         1     9.18125
+Smaug-Qwen2-72B-Instruct   1     9.05625
+########## Second turn ##########
+                   score
+model             turn
+Qwen2-72B-Instruct         2     8.74684
+Smaug-Qwen2-72B-Instruct   2     8.67500
+########## Average ##########
+                 score
+model
+Qwen2-72B-Instruct                8.96541
+Smaug-Qwen2-72B-Instruct          8.86563
 # Model Card for Model ID
 [More Information Needed]