multilingual
sea
nxphi47 commited on
Commit
7752b68
·
1 Parent(s): c1861bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -4
README.md CHANGED
@@ -158,16 +158,26 @@ In English, our model is 46% as good as Llama-2-13b-chat, even though it did not
158
 
159
  Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
160
  For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
161
- Note that **GPT-4**, as built for global use, may not consider certain safety-related responses from ChatGPT as harmful or sensitive in the local context.
162
- Using GPT-4 to evaluate ChatGPT-3.5 can also be tricky not only for safety aspects because they likely follow a similar training strategy with similar data.
163
- Meanwhile, most of the safety-related questions and expected responses in this test set are globally acceptable,
164
- whereas we leave those with conflicting and controversial opinions, as well as more comprehensive human evaluation for future update.
165
 
166
  <div class="row" style="display: flex; clear: both;">
167
  <img src="seallm_vs_chatgpt_by_lang.png" alt="Snow" style="float: left; width: 49.5%">
168
  <img src="seallm_vs_chatgpt_by_cat_sea.png" alt="Forest" style="float: left; width: 49.5%">
169
  </div>
170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  ### M3Exam - World Knowledge in Regional Languages
172
 
173
 
 
158
 
159
  Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
160
  For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
161
+ Note that using **GPT-4** to evaluate ChatGPT-3.5 can also be tricky not only for safety aspects because they likely follow a similar training strategy with similar data.
 
 
 
162
 
163
  <div class="row" style="display: flex; clear: both;">
164
  <img src="seallm_vs_chatgpt_by_lang.png" alt="Snow" style="float: left; width: 49.5%">
165
  <img src="seallm_vs_chatgpt_by_cat_sea.png" alt="Forest" style="float: left; width: 49.5%">
166
  </div>
167
 
168
+ As **GPT-4**, which was built for global use, may not consider certain safety-related responses as harmful or sensitive in the local context,
169
+ while certain sensitive topics may entail conflicting and controversial opinions across cultures.
170
+ We engage native linguists to rate and compare SeaLLM's and ChatGPT responses to a natural and local-aware safety test set.
171
+ The linguists choose a winner or a tie in a totally randomized and double-blind manner, which means both we and the linguists do not know the responses' origins.
172
+
173
+ As shown in human evaluation below, SeaLLM is tie with ChatGPT in most cases, while outperforming ChatGPT for Vi and Th.
174
+
175
+ | Safety Human Eval | Id | Th | Vi | Avg
176
+ |-----------| ------- | ------- | ------- | -------
177
+ | SeaLLM-13b Win | 12.09% | 23.40% | 8.42% | 14.64%
178
+ | Tie | 65.93% | 67.02% | 89.47% | 74.29%
179
+ | ChatGPT Win | 21.98% | 9.57% | 2.11% | 11.07%
180
+
181
  ### M3Exam - World Knowledge in Regional Languages
182
 
183