---
license: apache-2.0
base_model: google/mt5-large
tags:
- thai
- grammatical-error-correction
- mt5
- fine-tuned
- l2-learners
- generated_from_keras_callback
model-index:
- name: pakawadeep/ctfl-gec-th
  results:
    - task:
        name: Grammatical Error Correction
        type: text2text-generation
      dataset:
        name: CTFL-GEC
        type: custom
      metrics:
        - name: Precision
          type: precision
          value: 0.47
        - name: Recall
          type: recall
          value: 0.47
        - name: F1
          type: f1
          value: 0.47
        - name: F0.5
          type: f0.5
          value: 0.47
        - name: BLEU
          type: bleu
          value: 0.69
        - name: GLEU
          type: gleu
          value: 0.68
        - name: CHRF
          type: chrf
          value: 0.87
language:
- th
---

# pakawadeep/ctfl-gec-th

This model is a fine-tuned version of [google/mt5-large](https://huggingface.co/google/mt5-large), trained for **Grammatical Error Correction (GEC)** in **Thai** for **L2 learners**. It was developed as part of the research *"Grammatical Error Correction for L2 Learners of Thai Using Large Language Models"*, and represents the best-performing model in the study.

## Model description

This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs.

The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles.

## Intended uses & limitations

### Intended uses
- Grammatical error correction for Thai language learners
- Linguistic analysis of L2 learner errors
- Research in low-resource GEC methods

### Limitations
- May not generalize to informal or dialectal Thai
- Performance may degrade on sentence types or domains not represented in the training data
- Designed for Thai GEC only; not optimized for multilingual correction tasks

## Training and evaluation data

The model was fine-tuned on a combined dataset consisting of:
- **CTFL-GEC**: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences)
- **Self-Instruct augmentation (200%)**: Synthetic GEC pairs generated using LLM prompting

Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics.

## Training procedure

### Training hyperparameters
- **Optimizer**: AdamWeightDecay
- **Learning rate**: 2e-5
- **Beta1/Beta2**: 0.9 / 0.999
- **Epsilon**: 1e-7
- **Weight decay**: 0.01
- **Precision**: float32

### Framework versions
- Transformers 4.41.2
- TensorFlow 2.15.0
- Datasets 2.20.0
- Tokenizers 0.19.1

## Citation

If you use this model, please cite the associated thesis:

```
Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025.
```