Sri Santh commited on
Commit
9ac9ea4
·
verified ·
1 Parent(s): cad635f

Upload 6 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ eval_loss_1.5b.png filter=lfs diff=lfs merge=lfs -text
37
+ train_loss_1.5b.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: Qwen/Qwen2.5-1.5B-Instruct
4
+ license: apache-2.0
5
+ datasets:
6
+ - shibing624/chinese_text_correction
7
+ language:
8
+ - zh
9
+ metrics:
10
+ - f1
11
+ tags:
12
+ - text-generation-inference
13
+ widget:
14
+ - text: "文本纠错:\n少先队员因该为老人让坐。"
15
+ ---
16
+
17
+
18
+
19
+ # Chinese Text Correction Model
20
+ 中文文本纠错模型chinese-text-correction-1.5b-lora:用于拼写纠错、语法纠错
21
+
22
+ `shibing624/chinese-text-correction-1.5b-lora` evaluate test data:
23
+
24
+ The overall performance of CSC **test**:
25
+
26
+ |input_text|predict_text|
27
+ |:--- |:--- |
28
+ |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
29
+
30
+ # Models
31
+
32
+ | Name | Base Model | Download |
33
+ |-----------------|-------------------|-----------------------------------------------------------------------|
34
+ | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
35
+ | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
36
+ | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
37
+ | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
38
+
39
+ ### 评估结果
40
+ - 评估指标:F1
41
+ - CSC(Chinese Spelling Correction): 拼写纠错模型,表示模型可以处理音似、形似、语法等长度对齐的错误纠正
42
+ - CTC(CHinese Text Correction): 文本纠错模型,表示模型支持拼写、语法等长度对齐的错误纠正,还可以处理多字、少字等长度不对齐的错误纠正
43
+ - GPU:Tesla V100,显存 32 GB
44
+
45
+ | Model Name | Model Link | Base Model | Avg | SIGHAN-2015 | EC-LAW | MCSC | GPU/CPU | QPS |
46
+ |:-----------------|:------------------------------------------------------------------------------------------------------------------------|:---------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
47
+ | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | 0.3409 | 0.3147 | 0.3763 | 0.3317 | CPU | 9 |
48
+ | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | 0.3984 | 0.7758 | 0.3156 | 0.1039 | GPU | 214 |
49
+ | ERNIE-CSC | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353 | 0.8383 | 0.3357 | 0.1318 | GPU | 114 |
50
+ | MacBERT-CSC | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | 0.3993 | 0.8314 | 0.1610 | 0.2055 | GPU | **224** |
51
+ | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | 0.4538 | 0.6572 | 0.4369 | 0.2672 | GPU | 3 |
52
+ | Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) | Qwen/Qwen2.5-1.5B-Instruct | 0.6802 | 0.3032 | 0.7846 | 0.9529 | GPU | 6 |
53
+ | Qwen2.5-7B-CTC | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) | Qwen/Qwen2.5-7B-Instruct | **0.8225** | 0.4917 | 0.9798 | 0.9959 | GPU | 3 |
54
+
55
+
56
+ ## Usage (pycorrector)
57
+
58
+ 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
59
+
60
+ Install package:
61
+ ```shell
62
+ pip install -U pycorrector
63
+ ```
64
+
65
+ ```python
66
+ from pycorrector.gpt.gpt_corrector import GptCorrector
67
+
68
+ if __name__ == '__main__':
69
+ error_sentences = [
70
+ '真麻烦你了。希望你们好好的跳无',
71
+ '少先队员因该为老人让坐',
72
+ '机七学习是人工智能领遇最能体现智能的一个分知',
73
+ '一只小鱼船浮在平净的河面上',
74
+ '我的家乡是有明的渔米之乡',
75
+ ]
76
+ m = GptCorrector("shibing624/chinese-text-correction-1.5b")
77
+
78
+ batch_res = m.correct_batch(error_sentences)
79
+ for i in batch_res:
80
+ print(i)
81
+ print()
82
+ ```
83
+
84
+ ## Usage (HuggingFace Transformers)
85
+ Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
86
+
87
+ First, you pass your input through the transformer model, then you get the generated sentence.
88
+
89
+ Install package:
90
+ ```
91
+ pip install transformers
92
+ ```
93
+
94
+ ```python
95
+ # pip install transformers
96
+ from transformers import AutoModelForCausalLM, AutoTokenizer
97
+ checkpoint = "shibing624/chinese-text-correction-1.5b"
98
+
99
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
100
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
101
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
102
+
103
+ input_content = "文本纠错:\n少先队员因该为老人让坐。"
104
+
105
+ messages = [{"role": "user", "content": input_content}]
106
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False)
107
+
108
+ print(input_text)
109
+
110
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
111
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
112
+
113
+ print(tokenizer.decode(outputs[0]))
114
+ ```
115
+
116
+ output:
117
+ ```shell
118
+ 少先队员应该为老人让座。
119
+ ```
120
+
121
+
122
+ 模型文件组成:
123
+ ```
124
+ shibing624/chinese-text-correction-1.5b-lora
125
+ ├── adapter_config.json
126
+ └── adapter_model.safetensors
127
+ ```
128
+
129
+ #### 训练参数:
130
+
131
+ - num_epochs: 8
132
+ - batch_size: 4
133
+ - steps: 36000
134
+ - eval_loss: 0.14
135
+ - base model: Qwen/Qwen2.5-1.5B-Instruct
136
+ - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
137
+ - train time: 9 days 8 hours
138
+ - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora/resolve/main/eval_loss_1.5b.png)
139
+ - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora/resolve/main/train_loss_1.5b.png)
140
+
141
+ ### 训练数据集
142
+ #### 中文纠错数据集
143
+
144
+ - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
145
+
146
+ 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
147
+
148
+ ### Framework versions
149
+
150
+ - PEFT 0.11.1
151
+
152
+ ## Citation
153
+
154
+ ```latex
155
+ @software{pycorrector,
156
+ author = {Xu Ming},
157
+ title = {pycorrector: Implementation of language model finetune},
158
+ year = {2024},
159
+ url = {https://github.com/shibing624/pycorrector},
160
+ }
161
+ ```
162
+
adapter_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "Qwen/Qwen2.5-1.5B-Instruct",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 16.0,
14
+ "lora_dropout": 0.05,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 8,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "v_proj",
24
+ "up_proj",
25
+ "down_proj",
26
+ "q_proj",
27
+ "gate_proj",
28
+ "k_proj",
29
+ "o_proj"
30
+ ],
31
+ "task_type": "CAUSAL_LM",
32
+ "use_dora": false,
33
+ "use_rslora": false,
34
+ "semantic_routing": {
35
+ "questions": [
36
+ "Is this query asking to fix '的地得' usage in Chinese?",
37
+ "Does this involve correcting common Chinese grammar patterns like '把' or '被'?",
38
+ "Is this about fixing punctuation marks in Chinese text?",
39
+ "Does this involve correcting simplified/traditional character usage or typos?"
40
+ ]
41
+ }
42
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05c1993e98764b6ed0590f4fbc3deca84640bbab4ec655686e05e0628dd10971
3
+ size 36981072
eval_loss_1.5b.png ADDED

Git LFS Details

  • SHA256: 0777607093bdea62b5dbabd9c333f249bd210bdc56e49f4c72f03396c0defa6a
  • Pointer size: 131 Bytes
  • Size of remote file: 141 kB
train_loss_1.5b.png ADDED

Git LFS Details

  • SHA256: be9a2bada1c0bb537293760f33eef2760f12f8fab03caea68b8168f6408ab818
  • Pointer size: 131 Bytes
  • Size of remote file: 217 kB
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff