--- license: mit language: - ja metrics: - f1 base_model: - tohoku-nlp/bert-large-japanese-v2 pipeline_tag: text-classification tags: - japanese - grammar - classification --- # Model Card for Model ID This model is a fine-tuned version of tohoku-nlp/bert-large-japanese-v2 designed to perform multi-class classification of Japanese grammar points. It was trained on labeled data sourced from the 日本語文型辞典 (grammar dictionary) and augmented with synthetic examples generated by a large language model. ## Uses ### Direct Use This model takes a Japanese sentence as input and predicts the most likely grammar point(s) used in that sentence. It can be integrated into language-learning applications, grammar checkers, or reading-assistant tools. TOC ### Out-of-Scope Use - Machine translation or text generation tasks. - Understanding semantics beyond grammar point identification. ## Finetune Details ### Finetune Data Source: 日本語文型辞典 covering ~2400 grammar points. Augmentation: Synthetic sentences generated via a large language model to balance low-frequency grammar points (minimum 20 examples per point). ### Finetune Procedure - Preprocessing: Tokenization with MeCab + Unidic lite; WordPiece subword encoding. - Batch size: 64 - Max sequence length: 128 tokens - Optimizer: AdamW (learning rate = 3e-5, weight decay = 0.05) - Scheduler: Linear warmup of 20% steps, then linear decay - Epochs: 10 - Mixed precision: Enabled (fp16) ## Evaluation - Test set: Held-out sentences from dictionary and synthetic set (10% of total). - Metrics: - F1 Score (macro): 83.51% - Top2 F1 Score (macro): 94.96%