metadata
language: ko
license: apache-2.0
tags:
- text2text-generation
- korean
- politeness
- typo-correction
Finetuned ET5 for Politeness and Typo Correction
This model is a fine-tuned version of j5ng/et5-typos-corrector
for politeness enhancement and typo correction in Korean text. It transforms informal or typo-laden sentences into polite, grammatically correct ones.
Dataset
- Source: Custom dataset (
last_dataset_v2.jsonl
) - Size: ~300 examples
- Task: Converts informal/erroneous Korean sentences to polite and correct ones.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("kimseongsan/finetuned-et5-politeness")
model = AutoModelForSeq2SeqLM.from_pretrained("kimseongsan/finetuned-et5-politeness")
input_text = "๊ณต์ํ: ์ ์ด๊ฑฐ ๋ ํ๋ ธ์ด์?์ข"
inputs = tokenizer(input_text, return_tensors="pt", max_length=64, truncation=True)
outputs = model.generate(**inputs, max_length=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: "์ ์ด๊ฒ์ ๋ ํ๋ฆฌ์
จ๋์? ์กฐ๊ธ ๋ ์ฃผ์ํด ์ฃผ์๋ฉด ์ข๊ฒ ์ต๋๋ค."
Training
- Base Model:
j5ng/et5-typos-corrector
- Training Args:
- Learning Rate: 2e-5
- Epochs: 5
- Batch Size: 8
- Optimizer: AdamW
- Hardware: GPU (e.g., NVIDIA T4)
Limitations
- Small dataset size may lead to overfitting.
- Limited to educational context (e.g., "์ค", "์์ "). Generalization to other domains may require additional data.
License
Apache 2.0