# bert4sl_punct_zh_public ## 项目地址 [https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct) ## 时间(time) 2024.6 ## 训练数据构成(dataset) 使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等; - [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry) - [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji) - [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20) - [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly) - [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w) - [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk) - [qwen-7b生成的100万好句] - [人民日报语料2000万] ## 训练说明 每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch; ## 调用-标点纠错 ```python import os os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1" from macro_correct import correct_punct ### 1.默认标点纠错(list输入) text_list = ["山不在高有仙则名。", "水不在深,有龙则灵", "斯是陋室惟吾德馨", "苔痕上阶绿草,色入帘青。" ] text_csc = correct_punct(text_list) print("默认标点纠错(list输入):") for res_i in text_csc: print(res_i) print("#" * 128) """ 默认标点纠错(list输入): {'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]} {'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]} {'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]} {'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]} """ ```