Macropodus
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,18 +1,51 @@
|
|
1 |
-
# bert4sl_punct_zh_public
|
2 |
-
##
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
- [
|
13 |
-
- [
|
14 |
-
- [
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# bert4sl_punct_zh_public
|
2 |
+
## 项目地址
|
3 |
+
[https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)
|
4 |
+
|
5 |
+
|
6 |
+
|
7 |
+
## 时间(time)
|
8 |
+
2024.6
|
9 |
+
|
10 |
+
## 训练数据构成(dataset)
|
11 |
+
使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
|
12 |
+
- [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
|
13 |
+
- [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
|
14 |
+
- [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
|
15 |
+
- [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
|
16 |
+
- [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w)
|
17 |
+
- [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk)
|
18 |
+
- [qwen-7b生成的100万好句]
|
19 |
+
- [人民日报语料2000万]
|
20 |
+
|
21 |
+
## 训练说明
|
22 |
+
每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;
|
23 |
+
|
24 |
+
|
25 |
+
## 调用-标点纠错
|
26 |
+
```python
|
27 |
+
import os
|
28 |
+
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
|
29 |
+
from macro_correct import correct_punct
|
30 |
+
|
31 |
+
|
32 |
+
### 1.默认标点纠错(list输入)
|
33 |
+
text_list = ["山不在高有仙则名。",
|
34 |
+
"水不在深,有龙则灵",
|
35 |
+
"斯是陋室惟吾德馨",
|
36 |
+
"苔痕上阶绿草,色入帘青。"
|
37 |
+
]
|
38 |
+
text_csc = correct_punct(text_list)
|
39 |
+
print("默认标点纠错(list输入):")
|
40 |
+
for res_i in text_csc:
|
41 |
+
print(res_i)
|
42 |
+
print("#" * 128)
|
43 |
+
"""
|
44 |
+
默认标点纠错(list输入):
|
45 |
+
{'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]}
|
46 |
+
{'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
|
47 |
+
{'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]}
|
48 |
+
{'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]}
|
49 |
+
"""
|
50 |
+
```
|
51 |
+
|