arvine111's picture
update README.md
ad0a839 verified
---
license: mit
language:
- ja
metrics:
- f1
base_model:
- tohoku-nlp/bert-large-japanese-v2
pipeline_tag: text-classification
tags:
- japanese
- grammar
- classification
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
This model is a fine-tuned version of tohoku-nlp/bert-large-japanese-v2 designed to perform multi-class classification of Japanese grammar points.
It was trained on labeled data sourced from the 日本語文型辞典 (grammar dictionary) and augmented with synthetic examples generated by a large language model.
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
This model takes a Japanese sentence as input and predicts the most likely grammar point(s) used in that sentence. It can be integrated into language-learning applications, grammar checkers, or reading-assistant tools.
TOC
### Out-of-Scope Use
- Machine translation or text generation tasks.
- Understanding semantics beyond grammar point identification.
## Finetune Details
### Finetune Data
Source: 日本語文型辞典 covering ~2400 grammar points.
Augmentation: Synthetic sentences generated via a large language model to balance low-frequency grammar points (minimum 20 examples per point).
### Finetune Procedure
- Preprocessing: Tokenization with MeCab + Unidic lite; WordPiece subword encoding.
- Batch size: 64
- Max sequence length: 128 tokens
- Optimizer: AdamW (learning rate = 3e-5, weight decay = 0.05)
- Scheduler: Linear warmup of 20% steps, then linear decay
- Epochs: 10
- Mixed precision: Enabled (fp16)
## Evaluation
- Test set: Held-out sentences from dictionary and synthetic set (10% of total).
- Metrics:
- F1 Score (macro): 83.51%
- Top2 F1 Score (macro): 94.96%