Macropodus
/

bert4sl_punct_zh_public

PyTorch

Model card Files Files and versions Community

Macropodus commited on Jan 21

Commit

8dcb424

verified ·

1 Parent(s): cbbf1e1

Update README.md

Browse files

Files changed (1) hide show

README.md +51 -18

README.md CHANGED Viewed

@@ -1,18 +1,51 @@
-# bert4sl_punct_zh_public
-## 时间(time)
-2024.6
-## 训练数据构成(dataset)
-使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
- - [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
- - [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
- - [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
- - [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
- - [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w)
- - [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk)
- - [qwen-7b生成的100万好句]
- - [人民日报语料2000万]
-## 训练说明
-每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;

+# bert4sl_punct_zh_public
+## 项目地址
+[https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)
+## 时间(time)
+2024.6
+## 训练数据构成(dataset)
+使用高质量语料过滤而成, 收集高质量语料, 并使用PPL过滤等;
+ - [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
+ - [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
+ - [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
+ - [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
+ - [学习强国428万数据](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w); 国内源[Macropodus/xuexiqiangguo_428w](https://hf-mirror.com/datasets/Macropodus/xuexiqiangguo_428w)
+ - [xi_talk40万](https://huggingface.co/datasets/Papersnake/xi_talk); 国内源[Papersnake/xi_talk](https://hf-mirror.com/datasets/Papersnake/xi_talk)
+ - [qwen-7b生成的100万好句]
+ - [人民日报语料2000万]
+## 训练说明
+每种标点的最大句子数为10万, 总计500万训练句子, 训练3epoch;
+## 调用-标点纠错
+```python
+import os
+os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
+from macro_correct import correct_punct
+### 1.默认标点纠错(list输入)
+text_list = ["山不在高有仙则名。",
+             "水不在深，有龙则灵",
+             "斯是陋室惟吾德馨",
+             "苔痕上阶绿草,色入帘青。"
+             ]
+text_csc = correct_punct(text_list)
+print("默认标点纠错(list输入):")
+for res_i in text_csc:
+    print(res_i)
+print("#" * 128)
+"""
+默认标点纠错(list输入):
+{'index': 0, 'source': '山不在高有仙则名。', 'target': '山不在高，有仙则名。', 'score': 0.9917, 'errors': [['', '，', 4, 0.9917]]}
+{'index': 1, 'source': '水不在深，有龙则灵', 'target': '水不在深，有龙则灵。', 'score': 0.9995, 'errors': [['', '。', 9, 0.9995]]}
+{'index': 2, 'source': '斯是陋室惟吾德馨', 'target': '斯是陋室，惟吾德馨。', 'score': 0.9999, 'errors': [['', '，', 4, 0.9999], ['', '。', 8, 0.9998]]}
+{'index': 3, 'source': '苔痕上阶绿草,色入帘青。', 'target': '苔痕上阶绿，草色入帘青。', 'score': 0.9998, 'errors': [['', '，', 5, 0.9998]]}
+"""
+```