--- license: apache-2.0 base_model: google/mt5-large tags: - thai - grammatical-error-correction - mt5 - fine-tuned - l2-learners - generated_from_keras_callback model-index: - name: pakawadeep/ctfl-gec-th results: - task: name: Grammatical Error Correction type: text2text-generation dataset: name: CTFL-GEC type: custom metrics: - name: Precision type: precision value: 0.47 - name: Recall type: recall value: 0.47 - name: F1 type: f1 value: 0.47 - name: F0.5 type: f0.5 value: 0.47 - name: BLEU type: bleu value: 0.69 - name: GLEU type: gleu value: 0.68 - name: CHRF type: chrf value: 0.87 language: - th --- # pakawadeep/ctfl-gec-th This model is a fine-tuned version of [google/mt5-large](https://huggingface.co/google/mt5-large), trained for **Grammatical Error Correction (GEC)** in **Thai** for **L2 learners**. It was developed as part of the research *"Grammatical Error Correction for L2 Learners of Thai Using Large Language Models"*, and represents the best-performing model in the study. ## Model description This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs. The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles. ## Intended uses & limitations ### Intended uses - Grammatical error correction for Thai language learners - Linguistic analysis of L2 learner errors - Research in low-resource GEC methods ### Limitations - May not generalize to informal or dialectal Thai - Performance may degrade on sentence types or domains not represented in the training data - Designed for Thai GEC only; not optimized for multilingual correction tasks ## Training and evaluation data The model was fine-tuned on a combined dataset consisting of: - **CTFL-GEC**: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences) - **Self-Instruct augmentation (200%)**: Synthetic GEC pairs generated using LLM prompting Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics. ## Training procedure ### Training hyperparameters - **Optimizer**: AdamWeightDecay - **Learning rate**: 2e-5 - **Beta1/Beta2**: 0.9 / 0.999 - **Epsilon**: 1e-7 - **Weight decay**: 0.01 - **Precision**: float32 ### Framework versions - Transformers 4.41.2 - TensorFlow 2.15.0 - Datasets 2.20.0 - Tokenizers 0.19.1 ## Citation If you use this model, please cite the associated thesis: ``` Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025. ```