bert4sl_punct_zh_public
项目地址
https://github.com/yongzhuo/macro-correct
时间(time)
2024.6
训练数据构成(dataset)
使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
- chinese-poetry/chinese-poetry
- chinese-poetry/huajianji
- garychowcmu/daizhigev20
- yangjianxin1/Firefly
- 学习强国428万数据; 国内源Macropodus/xuexiqiangguo_428w
- xi_talk40万; 国内源Papersnake/xi_talk
- [qwen-7b生成的100万好句]
- [人民日报语料2000万]
训练说明
每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;
调用-标点纠错
import os
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
from macro_correct import correct_punct
### 1.默认标点纠错(list输入)
text_list = ["山不在高有仙则名。",
"水不在深,有龙则灵",
"斯是陋室惟吾德馨",
"苔痕上阶绿草,色入帘青。"
]
text_csc = correct_punct(text_list)
print("默认标点纠错(list输入):")
for res_i in text_csc:
print(res_i)
print("#" * 128)
"""
默认标点纠错(list输入):
{'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]}
{'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
{'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]}
{'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]}
"""