kimseongsan's picture
Add model card
5475976 verified
metadata
language: ko
license: apache-2.0
tags:
  - text2text-generation
  - korean
  - politeness
  - typo-correction

Finetuned ET5 for Politeness and Typo Correction

This model is a fine-tuned version of j5ng/et5-typos-corrector for politeness enhancement and typo correction in Korean text. It transforms informal or typo-laden sentences into polite, grammatically correct ones.

Dataset

  • Source: Custom dataset (last_dataset_v2.jsonl)
  • Size: ~300 examples
  • Task: Converts informal/erroneous Korean sentences to polite and correct ones.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("kimseongsan/finetuned-et5-politeness")
model = AutoModelForSeq2SeqLM.from_pretrained("kimseongsan/finetuned-et5-politeness")

input_text = "๊ณต์†ํ™”: ์™œ ์ด๊ฑฐ ๋˜ ํ‹€๋ ธ์–ด์š”?์ข€"
inputs = tokenizer(input_text, return_tensors="pt", max_length=64, truncation=True)
outputs = model.generate(**inputs, max_length=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: "์™œ ์ด๊ฒƒ์„ ๋˜ ํ‹€๋ฆฌ์…จ๋‚˜์š”? ์กฐ๊ธˆ ๋” ์ฃผ์˜ํ•ด ์ฃผ์‹œ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค."

Training

  • Base Model: j5ng/et5-typos-corrector
  • Training Args:
    • Learning Rate: 2e-5
    • Epochs: 5
    • Batch Size: 8
    • Optimizer: AdamW
  • Hardware: GPU (e.g., NVIDIA T4)

Limitations

  • Small dataset size may lead to overfitting.
  • Limited to educational context (e.g., "์Œค", "์ˆ™์ œ"). Generalization to other domains may require additional data.

License

Apache 2.0