Commit
·
26a251b
1
Parent(s):
1494f15
Update README.md
Browse files
README.md
CHANGED
@@ -34,6 +34,8 @@ tags:
|
|
34 |
|
35 |
*Image drawn by GPT-4 DALL·E 3* TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...
|
36 |
|
|
|
|
|
37 |
**llama.cpp GGUF models**
|
38 |
GPT2Tokenizer fixed by [Kerfuffle](https://github.com/KerfuffleV2) on [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743), new models are now reuploaded.
|
39 |
|
@@ -93,9 +95,19 @@ Hard ACC:54.71
|
|
93 |
| ------------ | -------- | -------------- | ------ | ----------- | ------- | ------- | --------- | ---------- |
|
94 |
| causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
|
95 |
|
96 |
-
|
97 |
Win rate **88.26%** on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
**llama.cpp GGUF models**
|
100 |
GPT2Tokenizer 支持由 [Kerfuffle](https://github.com/KerfuffleV2) 修复于 [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743),新模型稍后上传。
|
101 |
|
@@ -155,3 +167,12 @@ STEM准确率:66.71
|
|
155 |
| causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
|
156 |
|
157 |
在 [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) 胜率 **88.26%** [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
*Image drawn by GPT-4 DALL·E 3* TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...
|
36 |
|
37 |
+
# CausalLM 14B
|
38 |
+
|
39 |
**llama.cpp GGUF models**
|
40 |
GPT2Tokenizer fixed by [Kerfuffle](https://github.com/KerfuffleV2) on [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743), new models are now reuploaded.
|
41 |
|
|
|
95 |
| ------------ | -------- | -------------- | ------ | ----------- | ------- | ------- | --------- | ---------- |
|
96 |
| causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
|
97 |
|
|
|
98 |
Win rate **88.26%** on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
|
99 |
|
100 |
+
## Other languages
|
101 |
+
We are currently unable to produce accurate benchmark templates for non-QA tasks (languages other than English and Chinese). However, we will be working on other language versions of the QA-Task challenge in the near future.
|
102 |
+
### Japanese Benchmark
|
103 |
+
| Task |Version| Metric |Value | |Stderr|
|
104 |
+
|----------------------|------:|--------|-----:|---|-----:|
|
105 |
+
|jcommonsenseqa-1.1-0.6| 1.1|acc |0.8213|± |0.0115|
|
106 |
+
|
107 |
+
*jcommonsenseqa benchmark result is very, very close to [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable), current SOTA Japanese LM. However, our model was not trained on a particularly large amount of text in Japanese. This seems to reflect the cross-language transferability of metalinguistics.*
|
108 |
+
|
109 |
+
# 中文说明
|
110 |
+
|
111 |
**llama.cpp GGUF models**
|
112 |
GPT2Tokenizer 支持由 [Kerfuffle](https://github.com/KerfuffleV2) 修复于 [https://github.com/ggerganov/llama.cpp/pull/3743](https://github.com/ggerganov/llama.cpp/pull/3743),新模型稍后上传。
|
113 |
|
|
|
167 |
| causallm-14b | **88.26087** | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
|
168 |
|
169 |
在 [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) 胜率 **88.26%** [view raw](https://github.com/tatsu-lab/alpaca_eval/blob/3a47dcd81c56f6a8e6a5711f2754013919fbe90a/results/causallm-14b/model_outputs.json)
|
170 |
+
|
171 |
+
## 其他语言
|
172 |
+
我们目前无法为非 QA 任务(英语和中文以外的语言)生成准确的基准模板。 不过,我们将在不久的将来开发其他语言版本的 QA-Task 挑战。
|
173 |
+
### 日文基准
|
174 |
+
| Task |Version| Metric |Value | |Stderr|
|
175 |
+
|----------------------|------:|--------|-----:|---|-----:|
|
176 |
+
|jcommonsenseqa-1.1-0.6| 1.1|acc |0.8213|± |0.0115|
|
177 |
+
|
178 |
+
*jcommonsenseqa 基准测试结果非常非常接近 [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable),当前 SOTA 日文 LM 。然而,我们的模型并未在日文上进行特别的大量文本训练。这似乎能体现元语言的跨语言迁移能力。*
|