|
--- |
|
language: |
|
- zh |
|
tags: |
|
- bert |
|
- pytorch |
|
- zh |
|
- pycorrector |
|
datasets: |
|
- shibing624/CSC |
|
- Weaxs/csc |
|
license: apache-2.0 |
|
--- |
|
# 繁體版中文錯別字校正模型 |
|
* 透過專案 [shibing624/pycorrector](https://github.com/shibing624/pycorrector) 程式訓練 |
|
* 以 [hfl/chinese-macbert-base](https://huggingface.co/hfl/chinese-macbert-base) 為基底模型產出 |
|
|
|
## 訓練資料 |
|
* 27萬筆 SIGHAN 來自 [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC) |
|
* 27萬筆 NLG 來自 [Weaxs/csc](https://huggingface.co/datasets/Weaxs/csc) |
|
* [Opencc](https://github.com/BYVoid/OpenCC) 之s2twp設定進行簡轉繁 |
|
|
|
## 訓練技巧 |
|
* 輸入句子長度需呈現常態分佈,錯字控制1~3個字元之間 |
|
* 引入FocalLoss將偵測錯別字視作物件偵測 |
|
* 輸出EntropyLoss與FocalLoss比重7:3 |
|
|
|
## SIGHAN驗證分數 |
|
| 模型 | 準確度 | 精確度 | 召回率 | F1分數 | |
|
| :--------------------------------- | -----: | -----: | -----: | -----: | |
|
| chinese-macbert-base | 0.88 | 0.09 | 0.31 | 0.14 | |
|
| macbert4csc-base-chinese輸出簡轉繁 | 0.99 | 0.79 | 0.95 | 0.86 | |
|
| macbert4csc-traditional-chinese | 1 | 0.9 | 0.99 | 0.94 | |
|
|
|
## NLG驗證分數 |
|
| 模型 | 準確度 | 精確度 | 召回率 | F1分數 | |
|
| :--------------------------------- | -----: | -----: | -----: | -----: | |
|
| chinese-macbert-base | 0.85 | 0.08 | 0.31 | 0.13 | |
|
| macbert4csc-base-chinese輸出簡轉繁 | 0.98 | 0.7 | 0.95 | 0.81 | |
|
| macbert4csc-traditional-chinese | 0.99 | 0.8 | 0.99 | 0.89 | |
|
|
|
### 誠摯感謝原作者XuMing開源研究成果 |
|
|