|
--- |
|
license: mit |
|
language: |
|
- ja |
|
metrics: |
|
- f1 |
|
base_model: |
|
- tohoku-nlp/bert-large-japanese-v2 |
|
pipeline_tag: text-classification |
|
tags: |
|
- japanese |
|
- grammar |
|
- classification |
|
--- |
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This model is a fine-tuned version of tohoku-nlp/bert-large-japanese-v2 designed to perform multi-class classification of Japanese grammar points. |
|
It was trained on labeled data sourced from the 日本語文型辞典 (grammar dictionary) and augmented with synthetic examples generated by a large language model. |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
This model takes a Japanese sentence as input and predicts the most likely grammar point(s) used in that sentence. It can be integrated into language-learning applications, grammar checkers, or reading-assistant tools. |
|
|
|
TOC |
|
|
|
### Out-of-Scope Use |
|
|
|
- Machine translation or text generation tasks. |
|
- Understanding semantics beyond grammar point identification. |
|
|
|
|
|
## Finetune Details |
|
|
|
### Finetune Data |
|
|
|
Source: 日本語文型辞典 covering ~2400 grammar points. |
|
Augmentation: Synthetic sentences generated via a large language model to balance low-frequency grammar points (minimum 20 examples per point). |
|
|
|
### Finetune Procedure |
|
|
|
- Preprocessing: Tokenization with MeCab + Unidic lite; WordPiece subword encoding. |
|
- Batch size: 64 |
|
- Max sequence length: 128 tokens |
|
- Optimizer: AdamW (learning rate = 3e-5, weight decay = 0.05) |
|
- Scheduler: Linear warmup of 20% steps, then linear decay |
|
- Epochs: 10 |
|
- Mixed precision: Enabled (fp16) |
|
|
|
## Evaluation |
|
|
|
- Test set: Held-out sentences from dictionary and synthetic set (10% of total). |
|
- Metrics: |
|
- F1 Score (macro): 83.51% |
|
- Top2 F1 Score (macro): 94.96% |