metadata

license: apache-2.0
base_model: google/mt5-large
tags:
  - thai
  - grammatical-error-correction
  - mt5
  - fine-tuned
  - l2-learners
  - generated_from_keras_callback
model-index:
  - name: pakawadeep/ctfl-gec-th
    results:
      - task:
          name: Grammatical Error Correction
          type: text2text-generation
        dataset:
          name: CTFL-GEC
          type: custom
        metrics:
          - name: Precision
            type: precision
            value: 0.47
          - name: Recall
            type: recall
            value: 0.47
          - name: F1
            type: f1
            value: 0.47
          - name: F0.5
            type: f0.5
            value: 0.47
          - name: BLEU
            type: bleu
            value: 0.69
          - name: GLEU
            type: gleu
            value: 0.68
          - name: CHRF
            type: chrf
            value: 0.87
language:
  - th

pakawadeep/ctfl-gec-th

This model is a fine-tuned version of google/mt5-large, trained for Grammatical Error Correction (GEC) in Thai for L2 learners. It was developed as part of the research "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", and represents the best-performing model in the study.

Model description

This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs.

The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles.

Intended uses & limitations

Intended uses

Grammatical error correction for Thai language learners
Linguistic analysis of L2 learner errors
Research in low-resource GEC methods

Limitations

May not generalize to informal or dialectal Thai
Performance may degrade on sentence types or domains not represented in the training data
Designed for Thai GEC only; not optimized for multilingual correction tasks

Training and evaluation data

The model was fine-tuned on a combined dataset consisting of:

CTFL-GEC: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences)
Self-Instruct augmentation (200%): Synthetic GEC pairs generated using LLM prompting

Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics.

Training procedure

Training hyperparameters

Optimizer: AdamWeightDecay
Learning rate: 2e-5
Beta1/Beta2: 0.9 / 0.999
Epsilon: 1e-7
Weight decay: 0.01
Precision: float32

Framework versions

Transformers 4.41.2
TensorFlow 2.15.0
Datasets 2.20.0
Tokenizers 0.19.1

Citation

If you use this model, please cite the associated thesis:

Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025.