File size: 2,263 Bytes
8dcb424 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# bert4sl_punct_zh_public
## 项目地址
[https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)
## 时间(time)
2024.6
## 训练数据构成(dataset)
使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
- [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
- [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
- [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
- [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
- [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w)
- [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk)
- [qwen-7b生成的100万好句]
- [人民日报语料2000万]
## 训练说明
每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;
## 调用-标点纠错
```python
import os
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
from macro_correct import correct_punct
### 1.默认标点纠错(list输入)
text_list = ["山不在高有仙则名。",
"水不在深,有龙则灵",
"斯是陋室惟吾德馨",
"苔痕上阶绿草,色入帘青。"
]
text_csc = correct_punct(text_list)
print("默认标点纠错(list输入):")
for res_i in text_csc:
print(res_i)
print("#" * 128)
"""
默认标点纠错(list输入):
{'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]}
{'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
{'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]}
{'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]}
"""
```
|