# bert4sl_punct_zh_public
## 项目地址
[https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)


## 时间(time)
2024.6

## 训练数据构成(dataset)
使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
 - [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
 - [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
 - [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
 - [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
 - [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w)
 - [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk)
 - [qwen-7b生成的100万好句]
 - [人民日报语料2000万]

## 训练说明
每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;


## 调用-标点纠错
```python
import os
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
from macro_correct import correct_punct


### 1.默认标点纠错(list输入)
text_list = ["山不在高有仙则名。",
             "水不在深，有龙则灵",
             "斯是陋室惟吾德馨",
             "苔痕上阶绿草,色入帘青。"
             ]
text_csc = correct_punct(text_list)
print("默认标点纠错(list输入):")
for res_i in text_csc:
    print(res_i)
print("#" * 128)
"""
默认标点纠错(list输入):
{'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高，有仙则名。', 'score': 0.9917, 'errors': [['', '，', 4, 0.9917]]}
{'index': 1, 'source': '水不在深，有龙则灵', 'target': '水不在深，有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
{'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室，惟吾德馨。', 'score': 0.9999, 'errors': [['', '，', 4, 0.9999], ['', '。', 8, 0.9998]]}
{'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿，草色入帘青。', 'score': 0.9998, 'errors': [['', '，', 5, 0.9998]]}
"""
```