Chuboy
/

macbert4csc-traditional-chinese

Inference Endpoints

Model card Files Files and versions Community

macbert4csc-traditional-chinese / README.md

Chuboy's picture

飲水思源

04e9030 verified 11 months ago

|

history blame contribute delete

1.69 kB

	---
	language:
	- zh
	tags:
	- bert
	- pytorch
	- zh
	- pycorrector
	datasets:
	- shibing624/CSC
	- Weaxs/csc
	license: apache-2.0
	---
	# 繁體版中文錯別字校正模型
	* 透過專案 [shibing624/pycorrector](https://github.com/shibing624/pycorrector) 程式訓練
	* 以 [hfl/chinese-macbert-base](https://huggingface.co/hfl/chinese-macbert-base) 為基底模型產出

	## 訓練資料
	* 27萬筆 SIGHAN 來自 [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC)
	* 27萬筆 NLG 來自 [Weaxs/csc](https://huggingface.co/datasets/Weaxs/csc)
	* [Opencc](https://github.com/BYVoid/OpenCC) 之s2twp設定進行簡轉繁

	## 訓練技巧
	* 輸入句子長度需呈現常態分佈，錯字控制1~3個字元之間
	* 引入FocalLoss將偵測錯別字視作物件偵測
	* 輸出EntropyLoss與FocalLoss比重7:3

	## SIGHAN驗證分數
	\| 模型 \| 準確度 \| 精確度 \| 召回率 \| F1分數 \|
	\| :--------------------------------- \| -----: \| -----: \| -----: \| -----: \|
	\| chinese-macbert-base \| 0.88 \| 0.09 \| 0.31 \| 0.14 \|
	\| macbert4csc-base-chinese輸出簡轉繁 \| 0.99 \| 0.79 \| 0.95 \| 0.86 \|
	\| macbert4csc-traditional-chinese \| 1 \| 0.9 \| 0.99 \| 0.94 \|

	## NLG驗證分數
	\| 模型 \| 準確度 \| 精確度 \| 召回率 \| F1分數 \|
	\| :--------------------------------- \| -----: \| -----: \| -----: \| -----: \|
	\| chinese-macbert-base \| 0.85 \| 0.08 \| 0.31 \| 0.13 \|
	\| macbert4csc-base-chinese輸出簡轉繁 \| 0.98 \| 0.7 \| 0.95 \| 0.81 \|
	\| macbert4csc-traditional-chinese \| 0.99 \| 0.8 \| 0.99 \| 0.89 \|

	### 誠摯感謝原作者XuMing開源研究成果