Macropodus commited on
Commit
8dcb424
·
verified ·
1 Parent(s): cbbf1e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -18
README.md CHANGED
@@ -1,18 +1,51 @@
1
- # bert4sl_punct_zh_public
2
- ## 时间(time)
3
- 2024.6
4
-
5
- ## 训练数据构成(dataset)
6
- 使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
7
- - [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
8
- - [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
9
- - [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
10
- - [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
11
- - [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w)
12
- - [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk)
13
- - [qwen-7b生成的100万好句]
14
- - [人民日报语料2000万]
15
-
16
- ## 训练说明
17
- 每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;
18
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # bert4sl_punct_zh_public
2
+ ## 项目地址
3
+ [https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)
4
+
5
+
6
+
7
+ ## 时间(time)
8
+ 2024.6
9
+
10
+ ## 训练数据构成(dataset)
11
+ 使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
12
+ - [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
13
+ - [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
14
+ - [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
15
+ - [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
16
+ - [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w)
17
+ - [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk)
18
+ - [qwen-7b生成的100万好句]
19
+ - [人民日报语料2000万]
20
+
21
+ ## 训练说明
22
+ 每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;
23
+
24
+
25
+ ## 调用-标点纠错
26
+ ```python
27
+ import os
28
+ os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
29
+ from macro_correct import correct_punct
30
+
31
+
32
+ ### 1.默认标点纠错(list输入)
33
+ text_list = ["山不在高有仙则名。",
34
+ "水不在深,有龙则灵",
35
+ "斯是陋室惟吾德馨",
36
+ "苔痕上阶绿草,色入帘青。"
37
+ ]
38
+ text_csc = correct_punct(text_list)
39
+ print("默认标点纠错(list输入):")
40
+ for res_i in text_csc:
41
+ print(res_i)
42
+ print("#" * 128)
43
+ """
44
+ 默认标点纠错(list输入):
45
+ {'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高,有仙则名。', 'score': 0.9917, 'errors': [['', ',', 4, 0.9917]]}
46
+ {'index': 1, 'source': '水不在深,有龙则灵', 'target': '水不在深,有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
47
+ {'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室,惟吾德馨。', 'score': 0.9999, 'errors': [['', ',', 4, 0.9999], ['', '。', 8, 0.9998]]}
48
+ {'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿,草色入帘青。', 'score': 0.9998, 'errors': [['', ',', 5, 0.9998]]}
49
+ """
50
+ ```
51
+