license: apache-2.0
base_model: google/mt5-large
tags:
- thai
- grammatical-error-correction
- mt5
- fine-tuned
- l2-learners
- generated_from_keras_callback
model-index:
- name: pakawadeep/ctfl-gec-th
results:
- task:
name: Grammatical Error Correction
type: text2text-generation
dataset:
name: CTFL-GEC
type: custom
metrics:
- name: Precision
type: precision
value: 0.47
- name: Recall
type: recall
value: 0.47
- name: F1
type: f1
value: 0.47
- name: F0.5
type: f0.5
value: 0.47
- name: BLEU
type: bleu
value: 0.69
- name: GLEU
type: gleu
value: 0.68
- name: CHRF
type: chrf
value: 0.87
language:
- th
pakawadeep/ctfl-gec-th
This model is a fine-tuned version of google/mt5-large, trained for Grammatical Error Correction (GEC) in Thai for L2 learners. It was developed as part of the research "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", and represents the best-performing model in the study.
Model description
This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs.
The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles.
Intended uses & limitations
Intended uses
- Grammatical error correction for Thai language learners
- Linguistic analysis of L2 learner errors
- Research in low-resource GEC methods
Limitations
- May not generalize to informal or dialectal Thai
- Performance may degrade on sentence types or domains not represented in the training data
- Designed for Thai GEC only; not optimized for multilingual correction tasks
Training and evaluation data
The model was fine-tuned on a combined dataset consisting of:
- CTFL-GEC: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences)
- Self-Instruct augmentation (200%): Synthetic GEC pairs generated using LLM prompting
Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics.
Training procedure
Training hyperparameters
- Optimizer: AdamWeightDecay
- Learning rate: 2e-5
- Beta1/Beta2: 0.9 / 0.999
- Epsilon: 1e-7
- Weight decay: 0.01
- Precision: float32
Framework versions
- Transformers 4.41.2
- TensorFlow 2.15.0
- Datasets 2.20.0
- Tokenizers 0.19.1
Citation
If you use this model, please cite the associated thesis:
Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025.