--- library_name: transformers tags: - slang - korea - 비속어 - 번역기 - translator - 한국어 license: mit language: - ko base_model: - hyunwoongko/kobart pipeline_tag: translation --- # KoBART 기반 한국어 슬랭 번역기 이 모델은 KoBART를 파인튜닝하여 한국어 비속어를 표준어로 번역해주는 번역기입니다. ## 사용한 모델 [SKT - KoBART](https://github.com/SKT-AI/KoBART) ## 데이터셋 AI Hub의 [연령대별 특징적 발화(은어-속어 등) 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71320)를 사용하였습니다. 10대, 20대, 30대의 유행어 및 속어 데이터를 활용하여 학습을 진행하였습니다. ## 사용 예시 ```python Input Text: 아 롤하는데 한타에서 졌어 Generated Text: 아 리그 오브 레전드하는데 대규모 교전에서 졌어 ``` ## 학습 세부정보 ### 하이퍼 파라미터 ```python training_args = TrainingArguments( output_dir="your dir", evaluation_strategy="steps", eval_steps=10000, save_strategy="steps", save_steps=10000, learning_rate=2e-5, per_device_train_batch_size=10, per_device_eval_batch_size=8, logging_dir="your dir", num_train_epochs=5, weight_decay=0.01, fp16=True, report_to="none", logging_steps=1000, warmup_steps=500, lr_scheduler_type="linear", load_best_model_at_end=True, metric_for_best_model="eval_loss", ) ``` ## 학습 환경 • GPU: NVIDIA RTX A5000 • 학습 시간: 8시간 ### 학습 결과 | Step | Training Loss | Validation Loss | |---------|---------------|-----------------| | 100000 | 0.0591000 | 0.047132 | | 200000 | 0.0303000 | 0.024423 | | 300000 | 0.0208000 | 0.017365 | | 400000 | 0.0159000 | 0.013130 | | 500000 | 0.0129000 | 0.011025 | | 5900000 | 0.0002000 | 0.007907 | | 6000000 | 0.0002000 | 0.007920 | | 6100000 | 0.0002000 | 0.007869 | ## 사용 방법 ```python from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration MODEL_NAME = "hongggggggggggg/korea-slang-translator-kobert" tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_NAME) model = BartForConditionalGeneration.from_pretrained(MODEL_NAME) # 테스트 입력 데이터 input_text = "아 롤하는데 한타에서 졌어" # 입력 텍스트를 토크나이즈 input_ids = tokenizer.encode(input_text, return_tensors="pt") # 모델 추론 output_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True) # 생성된 텍스트 디코딩 output_text = tokenizer.decode(output_ids, skip_special_tokens=True) # 결과 출력 print("Input Text:", input_text) print("Generated Text:", output_text) ``` -------- --- library_name: transformers tags: - slang - korea - profanity - translator - Korean license: mit language: - ko base_model: - skt/kobert-base-v1 pipeline_tag: translation --- # KoBERT-based Korean Slang Translator This model is a translator that converts Korean slang into standard language by fine-tuning KoBERT. ## Base Model [SKT - KoBERT](https://github.com/SKTBrain/KoBERT) ## Dataset We used the [Age-specific Characteristic Utterances (Slang-Profanity, etc.) Data](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71320) from AI Hub. The training was conducted using trendy words and slang data from teenagers, 20s, and 30s. ## Usage Example ```python Input Text: 아 롤하는데 한타에서 졌어 Generated Text: 아 리그 오브 레전드하는데 대규모 교전에서 졌어 ``` ## Training Details ### Hyperparameters ```python training_args = TrainingArguments( output_dir="your dir", evaluation_strategy="steps", eval_steps=10000, save_strategy="steps", save_steps=10000, learning_rate=2e-5, per_device_train_batch_size=10, per_device_eval_batch_size=8, logging_dir="your dir", num_train_epochs=5, weight_decay=0.01, fp16=True, report_to="none", logging_steps=1000, warmup_steps=500, lr_scheduler_type="linear", load_best_model_at_end=True, metric_for_best_model="eval_loss", ) ``` ### Training Environment - GPU: NVIDIA RTX A5000 - Training Time: 8 hours ### Training Results | Step | Training Loss | Validation Loss | |---------|---------------|-----------------| | 100000 | 0.0591000 | 0.047132 | | 200000 | 0.0303000 | 0.024423 | | 300000 | 0.0208000 | 0.017365 | | 400000 | 0.0159000 | 0.013130 | | 500000 | 0.0129000 | 0.011025 | | 5900000 | 0.0002000 | 0.007907 | | 6000000 | 0.0002000 | 0.007920 | | 6100000 | 0.0002000 | 0.007869 | ## How to Use ```python from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration MODEL_NAME = "hongggggggggggg/korea-slang-translator-kobert" tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_NAME) model = BartForConditionalGeneration.from_pretrained(MODEL_NAME) # Test input data input_text = "아 롤하는데 한타에서 졌어" # Tokenize input text input_ids = tokenizer.encode(input_text, return_tensors="pt") # Model inference output_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True) # Decode generated text output_text = tokenizer.decode(output_ids, skip_special_tokens=True) # Print results print("Input Text:", input_text) print("Generated Text:", output_text) ```