|
--- |
|
license: apache-2.0 |
|
base_model: google/mt5-large |
|
tags: |
|
- thai |
|
- grammatical-error-correction |
|
- mt5 |
|
- fine-tuned |
|
- l2-learners |
|
- generated_from_keras_callback |
|
model-index: |
|
- name: pakawadeep/ctfl-gec-th |
|
results: |
|
- task: |
|
name: Grammatical Error Correction |
|
type: text2text-generation |
|
dataset: |
|
name: CTFL-GEC |
|
type: custom |
|
metrics: |
|
- name: Precision |
|
type: precision |
|
value: 0.47 |
|
- name: Recall |
|
type: recall |
|
value: 0.47 |
|
- name: F1 |
|
type: f1 |
|
value: 0.47 |
|
- name: F0.5 |
|
type: f0.5 |
|
value: 0.47 |
|
- name: BLEU |
|
type: bleu |
|
value: 0.69 |
|
- name: GLEU |
|
type: gleu |
|
value: 0.68 |
|
- name: CHRF |
|
type: chrf |
|
value: 0.87 |
|
language: |
|
- th |
|
--- |
|
|
|
# pakawadeep/ctfl-gec-th |
|
|
|
This model is a fine-tuned version of [google/mt5-large](https://huggingface.co/google/mt5-large), trained for **Grammatical Error Correction (GEC)** in **Thai** for **L2 learners**. It was developed as part of the research *"Grammatical Error Correction for L2 Learners of Thai Using Large Language Models"*, and represents the best-performing model in the study. |
|
|
|
## Model description |
|
|
|
This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs. |
|
|
|
The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles. |
|
|
|
## Intended uses & limitations |
|
|
|
### Intended uses |
|
- Grammatical error correction for Thai language learners |
|
- Linguistic analysis of L2 learner errors |
|
- Research in low-resource GEC methods |
|
|
|
### Limitations |
|
- May not generalize to informal or dialectal Thai |
|
- Performance may degrade on sentence types or domains not represented in the training data |
|
- Designed for Thai GEC only; not optimized for multilingual correction tasks |
|
|
|
## Training and evaluation data |
|
|
|
The model was fine-tuned on a combined dataset consisting of: |
|
- **CTFL-GEC**: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences) |
|
- **Self-Instruct augmentation (200%)**: Synthetic GEC pairs generated using LLM prompting |
|
|
|
Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics. |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
- **Optimizer**: AdamWeightDecay |
|
- **Learning rate**: 2e-5 |
|
- **Beta1/Beta2**: 0.9 / 0.999 |
|
- **Epsilon**: 1e-7 |
|
- **Weight decay**: 0.01 |
|
- **Precision**: float32 |
|
|
|
### Framework versions |
|
- Transformers 4.41.2 |
|
- TensorFlow 2.15.0 |
|
- Datasets 2.20.0 |
|
- Tokenizers 0.19.1 |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the associated thesis: |
|
|
|
``` |
|
Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025. |
|
``` |
|
|