Soyoung97 commited on
Commit
c4a70d9
·
1 Parent(s): a7c1341

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Korean Grammatical Error Correction Model
2
+ maintainer: [Soyoung Yoon](https://soyoung97.github.io/profile/)
3
+
4
+ Official repository: [link](https://github.com/soyoung97/GEC-Korean)
5
+
6
+ Dataset request form: [link](https://forms.gle/kF9pvJbLGvnh8ZnQ6)
7
+
8
+ Demo: [link](https://huggingface.co/spaces/Soyoung97/gec-korean-demo)
9
+
10
+ Colab demo: [link](https://colab.research.google.com/drive/1CL__3CpkhBzxWUbvsQmPTQWWu1cWmJHa?usp=sharing)
11
+
12
+
13
+ ### Sample code
14
+ ```
15
+ import torch
16
+ from transformers import PreTrainedTokenizerFast
17
+ from transformers import BartForConditionalGeneration
18
+
19
+ tokenizer = PreTrainedTokenizerFast.from_pretrained('Soyoung97/gec_kr')
20
+ model = BartForConditionalGeneration.from_pretrained('Soyoung97/gec_kr')
21
+
22
+ text = '한국어는어렵다.'
23
+
24
+ raw_input_ids = tokenizer.encode(text)
25
+ input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]
26
+ corrected_ids = model.generate(torch.tensor([input_ids]),
27
+ max_length=128,
28
+ eos_token_id=1, num_beams=4,
29
+ early_stopping=True, repetition_penalty=2.0)
30
+ output_text = tokenizer.decode(corrected_ids.squeeze().tolist(), skip_special_tokens=True)
31
+
32
+
33
+ output_text
34
+ >>> '한국어는 어렵다.'
35
+ ```
36
+
37
+ Special thanks to the [KoBART-summarization repository](https://huggingface.co/gogamza/kobart-summarization) (referenced from it)