Update README.md. Added Korean benchmarks. (#38)

Browse files

- Update README.md. Added Korean benchmarks. (a45ce929f2d146c8d07a1020011d4b25028d734d)

Co-authored-by: Daekeun Kim <[email protected]>

Files changed (1) hide show

README.md +97 -1

README.md CHANGED Viewed

@@ -276,4 +276,100 @@ Note that by default, the Phi-3.5-MoE-instruct model uses flash attention, which
 The model is licensed under the [MIT license](./LICENSE).
 ## Trademarks
-This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

 The model is licensed under the [MIT license](./LICENSE).
 ## Trademarks
+This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
+## Appendix A: Korean benchmarks
+The prompt is the same as the [CLIcK paper](https://arxiv.org/abs/2403.06412) prompt. The experimental results below were given with max_tokens=512 (zero-shot), max_tokens=1024 (5-shot), temperature=0.01. No system prompt used.
+- GPT-4o: 2024-05-13 version
+- GPT-4o-mini: 2024-07-18 version
+- GPT-4-turbo: 2024-04-09 version
+- GPT-3.5-turbo: 2023-06-13 version
+| Benchmarks               |   Phi-3.5-MoE-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:-------------------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| CLIcK                    |                  56.44 |                           29.12 |                   47.82 |    80.46 |         68.5  |         72.82 |           50.98 |
+| HAERAE 1.0               |                  61.83 |                           36.41 |                   53.9  |    85.7  |         76.4  |         77.76 |           52.67 |
+| KMMLU (0-shot, CoT)      |                  47.43 |                           30.82 |                   38.54 |    64.26 |         52.63 |         58.75 |           40.3  |
+| KMMLU (5-shot)           |                  47.92 |                           29.98 |                   20.21 |    64.28 |         51.62 |         59.29 |           42.28 |
+| KMMLU-HARD (0-shot, CoT) |                  25.34 |                           25.68 |                   24.03 |    39.62 |         24.56 |         30.56 |           20.97 |
+| KMMLU-HARD (5-shot)      |                  25.66 |                           25.73 |                   15.81 |    40.94 |         24.63 |         31.12 |           21.19 |
+| Average                  |                  45.82 |                           29.99 |                   29.29 |    62.54 |         50.08 |         56.74 |           39.61 |
+#### CLIcK (Cultural and Linguistic Intelligence in Korean)
+##### Accuracy by supercategory
+| supercategory   |   Phi-3.5-MoE-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Culture         |                  58.44 |                           29.74 |                   51.15 |    81.89 |         70.95 |         73.61 |           53.38 |
+| Language        |                  52.31 |                           27.85 |                   40.92 |    77.54 |         63.54 |         71.23 |           46    |
+| **Overall**     |                  56.44 |                           29.12 |                   47.82 |    80.46 |         68.5  |         72.82 |           50.98 |
+##### Accuracy by category
+| supercategory   | category    |   Phi-3.5-MoE-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|:------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Culture         | Economy     |                  77.97 |                           28.81 |                   66.1  |    94.92 |         83.05 |         89.83 |           64.41 |
+| Culture         | Geography   |                  60.31 |                           29.01 |                   54.2  |    80.15 |         77.86 |         82.44 |           53.44 |
+| Culture         | History     |                  33.93 |                           30    |                   29.64 |    66.92 |         48.4  |         46.4  |           31.79 |
+| Culture         | Law         |                  52.51 |                           22.83 |                   44.29 |    70.78 |         57.53 |         61.19 |           41.55 |
+| Culture         | Politics    |                  70.24 |                           33.33 |                   59.52 |    88.1  |         83.33 |         89.29 |           65.48 |
+| Culture         | Pop Culture |                  80.49 |                           34.15 |                   60.98 |    97.56 |         85.37 |         92.68 |           75.61 |
+| Culture         | Society     |                  74.43 |                           31.72 |                   65.05 |    92.88 |         85.44 |         86.73 |           71.2  |
+| Culture         | Tradition   |                  58.11 |                           31.98 |                   54.95 |    87.39 |         74.77 |         79.28 |           55.86 |
+| Language        | Functional  |                  48    |                           24    |                   32.8  |    84.8  |         64.8  |         80    |           40    |
+| Language        | Grammar     |                  29.58 |                           23.33 |                   22.92 |    57.08 |         42.5  |         47.5  |           30    |
+| Language        | Textual     |                  73.33 |                           33.33 |                   59.65 |    91.58 |         80.7  |         87.37 |           62.11 |
+#### HAERAE 1.0
+| category              |   Phi-3.5-MoE-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| General Knowledge     |                  39.77 |                           28.41 |                   34.66 |    77.27 |         53.41 |         66.48 |           40.91 |
+| History               |                  60.64 |                           22.34 |                   44.15 |    92.02 |         84.57 |         78.72 |           30.32 |
+| Loan Words            |                  70.41 |                           35.5  |                   63.31 |    79.88 |         76.33 |         78.11 |           59.17 |
+| Rare Words            |                  63.95 |                           42.96 |                   63.21 |    87.9  |         81.98 |         79.01 |           61.23 |
+| Reading Comprehension |                  64.43 |                           41.16 |                   51.9  |    85.46 |         77.18 |         80.09 |           56.15 |
+| Standard Nomenclature |                  66.01 |                           32.68 |                   58.82 |    88.89 |         75.82 |         79.08 |           53.59 |
+| **Overall**           |                  61.83 |                           36.41 |                   53.9  |    85.7  |         76.4  |         77.76 |           52.67 |
+#### KMMLU (0-shot, CoT)
+| supercategory   |   Phi-3.5-MoE-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                  45.15 |                           31.68 |                   37.03 |    61.52 |         49.29 |         55.98 |           38.47 |
+| HUMSS           |                  49.75 |                           26.47 |                   37.29 |    69.45 |         56.59 |         63    |           40.9  |
+| Other           |                  47.24 |                           31.01 |                   39.15 |    63.79 |         52.35 |         57.53 |           40.19 |
+| STEM            |                  49.08 |                           31.9  |                   40.42 |    65.16 |         54.74 |         60.84 |           42.24 |
+| **Overall**     |                  47.43 |                           30.82 |                   38.54 |    64.26 |         52.63 |         58.75 |           40.3  |
+#### KMMLU (5-shot)
+| supercategory   |   Phi-3.5-MoE-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                  45.9  |                           29.98 |                   19.24 |    61.47 |         48.66 |         56.85 |           40.22 |
+| HUMSS           |                  49.18 |                           27.27 |                   22.5  |    68.79 |         55.95 |         63.68 |           43.35 |
+| Other           |                  48.43 |                           30.76 |                   20.95 |    64.21 |         51.1  |         57.85 |           41.92 |
+| STEM            |                  49.21 |                           30.73 |                   19.55 |    65.28 |         53.29 |         61.08 |           44.43 |
+| **Overall**     |                  47.92 |                           29.98 |                   20.21 |    64.28 |         51.62 |         59.29 |           42.28 |
+#### KMMLU-HARD (0-shot, CoT)
+| supercategory   |   Phi-3.5-MoE-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024)|   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                  25.83 |                           26.17 |                   26.25 |    37.12 |         22.25 |         29.17 |           21.07 |
+| HUMSS           |                  21.52 |                           24.38 |                   20.21 |    41.97 |         23.31 |         31.51 |           19.44 |
+| Other           |                  24.82 |                           24.82 |                   23.88 |    40.39 |         26.48 |         29.59 |           22.22 |
+| STEM            |                  28.18 |                           26.91 |                   24.64 |    39.82 |         26.36 |         32.18 |           20.91 |
+| **Overall**     |                  25.34 |                           25.68 |                   24.03 |    39.62 |         24.56 |         30.56 |           20.97 |
+#### KMMLU-HARD (5-shot)
+| supercategory   |   Phi-3.5-MoE-Instruct |  Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                  21    |                           29    |                   12    |    31    |         21    |         25    |           20    |
+| HUMSS           |                  22.88 |                           19.92 |                   14    |    43.98 |         23.47 |         33.53 |           19.53 |
+| Other           |                  25.13 |                           27.27 |                   12.83 |    39.84 |         28.34 |         29.68 |           23.22 |
+| STEM            |                  21.75 |                           25.25 |                   12.75 |    40.25 |         23.25 |         27.25 |           19.75 |
+| **Overall**     |                  25.66 |                           25.73 |                   15.81 |    40.94 |         24.63 |         31.12 |           21.19 |