ctfl-gec-th / README.md
pakawadeep's picture
Update README.md
6c267b0 verified
---
license: apache-2.0
base_model: google/mt5-large
tags:
- thai
- grammatical-error-correction
- mt5
- fine-tuned
- l2-learners
- generated_from_keras_callback
model-index:
- name: pakawadeep/ctfl-gec-th
results:
- task:
name: Grammatical Error Correction
type: text2text-generation
dataset:
name: CTFL-GEC
type: custom
metrics:
- name: Precision
type: precision
value: 0.47
- name: Recall
type: recall
value: 0.47
- name: F1
type: f1
value: 0.47
- name: F0.5
type: f0.5
value: 0.47
- name: BLEU
type: bleu
value: 0.69
- name: GLEU
type: gleu
value: 0.68
- name: CHRF
type: chrf
value: 0.87
language:
- th
---
# pakawadeep/ctfl-gec-th
This model is a fine-tuned version of [google/mt5-large](https://huggingface.co/google/mt5-large), trained for **Grammatical Error Correction (GEC)** in **Thai** for **L2 learners**. It was developed as part of the research *"Grammatical Error Correction for L2 Learners of Thai Using Large Language Models"*, and represents the best-performing model in the study.
## Model description
This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs.
The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles.
## Intended uses & limitations
### Intended uses
- Grammatical error correction for Thai language learners
- Linguistic analysis of L2 learner errors
- Research in low-resource GEC methods
### Limitations
- May not generalize to informal or dialectal Thai
- Performance may degrade on sentence types or domains not represented in the training data
- Designed for Thai GEC only; not optimized for multilingual correction tasks
## Training and evaluation data
The model was fine-tuned on a combined dataset consisting of:
- **CTFL-GEC**: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences)
- **Self-Instruct augmentation (200%)**: Synthetic GEC pairs generated using LLM prompting
Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics.
## Training procedure
### Training hyperparameters
- **Optimizer**: AdamWeightDecay
- **Learning rate**: 2e-5
- **Beta1/Beta2**: 0.9 / 0.999
- **Epsilon**: 1e-7
- **Weight decay**: 0.01
- **Precision**: float32
### Framework versions
- Transformers 4.41.2
- TensorFlow 2.15.0
- Datasets 2.20.0
- Tokenizers 0.19.1
## Citation
If you use this model, please cite the associated thesis:
```
Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025.
```