Model Card for prompt-parsing-v0-gemma-2-9b-lora

megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1 is a dependency parsing model which analyze a gold token sequence in user prompt in step-by-step way. This model is trained using the Universal Dependencies datasets over 7 languages, and provides SoTA-level accuracy for UPOS, UAS, and LAS.

megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1はユーザプロンプトで与えられた正解トークン列に対してstep-by-stepで依存構造解析を行うモデルです。このモデルはUniversal Dependenciesの7つの言語のデータセットを用いて訓練されており、UPOS, UAS, LASにおいてSoTAレベルの解析精度を持ちます。

Terms of Use

This LoRA adapter package is released under the CC BY-SA 4.0.
However, please note the following important conditions regarding its usage:

This package does not contain any part of the original Gemma 2 model.
In order to use this package, you must obtain and use the base model distributed from Google: Gemma 2 9B base on Hugging Face
Use of the Gemma models requires agreement to the Gemma Terms of Use.

利用規約 (Japanese version of the Terms of Use)

このLoRAアダプタパッケージは、CC BY-SA 4.0に基づいてリリースされています。
ただし、使用に関しては以下の重要な利用条件に注意してください。

このパッケージにはオリジナルのGemma 2モデルは含まれていません
このパッケージを使用するには、Googleが配布するGemmaモデルを入手して使用する必要があります: Gemma 2 9B base on Hugging Face
Gemmaモデルの使用にはGemma Terms of Useへの同意が必要です

Usage

Install

# for CUDA 12.1
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install vllm==0.7.2 sudachipy sudachidict-core

In this first release, we only provide code example using the sudachipy tokenizer, which matches the token boundaries of UD Japanese datasets. Code examples for other languages will be provided in upcoming releases.
本リリースでは、UD Japanese データセットのトークン境界との親和性の高いsudachipyをトークナイザーに使用したサンプルコードのみを提供します。他の言語向けのサンプルコードは、今後のリリースで提供予定です。

Code example

import json
import sudachipy
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

base_model = "google/gemma-2-9b"
adapter_model = "megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1"
input_language = "Japanese"
input_sentences = ["銀座でランチをご一緒しましょう。", "この時代から、日本列島に人類が住んだ遺跡や遺物が多く発見されている。"]

tokenizer = sudachipy.Dictionary().create(mode=sudachipy.Tokenizer.SplitMode.A)

def tokenize_japanese_space_after(sentence) -> list[str]:
    tokens = []
    for m in tokenizer.tokenize(sentence):
        surface = m.surface()
        if surface in [" ", "　"]:
            if tokens and tokens[-1][-1] != " ":
                tokens[-1] += " "
        else:
            tokens.append(surface)
    if tokens and tokens[-1][-1] != " ":
        tokens[-1] += " "
    return tokens

def apply_template(language: str, sentence: str, tokens: list[str]) -> list:
    return """You are an <<<LANGUAGE>>> linguist and specialize in <<<LANGUAGE>>> dependency analysis based on Universal Dependencies.
We will now perform dependency parsing on <<<LANGUAGE>>> sentence.
After splitting the input sentence into words as shown below, execute following three tasks:

- Task 1

Create a TSV with three fields: word index from 1 to <<<TOKEN_NUM>>> + word + part of speech.

- Task 2
Add a field for the dependent word indexes to each row to the output of Task 1.
However, for the word that is the main predicate of the sentence, the dependent word index should be 0.

- Task 3

Add a field for the Universal Dependencies relation labels to the output of Task 2.


input sentence:
<<<SENTENCE>>>

words:
<<<TOKENS>>>
""".replace("<<<LANGUAGE>>>", language).replace("<<<TOKEN_NUM>>>", str(len(tokens))).replace("<<<SENTENCE>>>", sentence).replace("<<<TOKENS>>>", "\n".join(tokens))

input_prompts = [
    [
        {
            "role": "user",
            "content": apply_template(input_language, s, tokenize_japanese_space_after(s)),
        }
    ] for s in input_sentences
]

llm = LLM(
    model=base_model,
    enable_lora=True,
    tokenizer=adapter_model,
    dtype="bfloat16",
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    enforce_eager=True,
)
sampling_params = SamplingParams(
    temperature=0.,
    max_tokens=1024,  # <= 8192
)
lora_request = LoRARequest("adapter", 1, adapter_model)

results = llm.chat(
    messages=input_prompts,
    sampling_params=sampling_params,
    use_tqdm=False,
    lora_request=lora_request,
)
for sentence, result in zip(input_sentences, results):
    print("# text =", sentence)
    print(result.outputs[0].text)

Outputs of Code example

# text = 銀座でランチをご一緒しましょう。
- Task 1

1	銀座	PROPN
2	で	ADP
3	ランチ	NOUN
4	を	ADP
5	ご	NOUN
6	一緒	NOUN
7	し	AUX
8	ましょう	AUX
9	。 	PUNCT

- Task 2

1	銀座	PROPN	6
2	で	ADP	1
3	ランチ	NOUN	6
4	を	ADP	3
5	ご	NOUN	6
6	一緒	NOUN	0
7	し	AUX	6
8	ましょう	AUX	6
9	。 	PUNCT	6

- Task 3

1	銀座	PROPN	6	nmod
2	で	ADP	1	case
3	ランチ	NOUN	6	obj
4	を	ADP	3	case
5	ご	NOUN	6	compound
6	一緒	NOUN	0	root
7	し	AUX	6	aux
8	ましょう	AUX	6	aux
9	。 	PUNCT	6	punct


# text = この時代から、日本列島に人類が住んだ遺跡や遺物が多く発見されている。
- Task 1

1	この	DET
2	時代	NOUN
3	から	ADP
4	、	PUNCT
5	日本	PROPN
6	列島	NOUN
7	に	ADP
8	人類	NOUN
9	が	ADP
10	住ん	VERB
11	だ	AUX
12	遺跡	NOUN
13	や	ADP
14	遺物	NOUN
15	が	ADP
16	多く	ADJ
17	発見	VERB
18	さ	AUX
19	れ	AUX
20	て	SCONJ
21	いる	VERB
22	。 	PUNCT

- Task 2

1	この	DET	2
2	時代	NOUN	17
3	から	ADP	2
4	、	PUNCT	2
5	日本	PROPN	6
6	列島	NOUN	10
7	に	ADP	6
8	人類	NOUN	10
9	が	ADP	8
10	住ん	VERB	12
11	だ	AUX	10
12	遺跡	NOUN	14
13	や	ADP	12
14	遺物	NOUN	17
15	が	ADP	14
16	多く	ADJ	17
17	発見	VERB	0
18	さ	AUX	17
19	れ	AUX	17
20	て	SCONJ	17
21	いる	VERB	20
22	。 	PUNCT	17

- Task 3

1	この	DET	2	det
2	時代	NOUN	17	obl
3	から	ADP	2	case
4	、	PUNCT	2	punct
5	日本	PROPN	6	compound
6	列島	NOUN	10	obl
7	に	ADP	6	case
8	人類	NOUN	10	nsubj
9	が	ADP	8	case
10	住ん	VERB	12	acl
11	だ	AUX	10	aux
12	遺跡	NOUN	14	nmod
13	や	ADP	12	case
14	遺物	NOUN	17	nsubj
15	が	ADP	14	case
16	多く	ADJ	17	advcl
17	発見	VERB	0	root
18	さ	AUX	17	aux
19	れ	AUX	17	aux
20	て	SCONJ	17	mark
21	いる	VERB	20	fixed
22	。 	PUNCT	17	punct

Training and Evaluation

Training Data and Hyper-parameters

We used the train-sets of the UD datasets below for LoRA SFT.
本モデルのLoRA SFTには次のUDデータセットのtrainセットを使用しました。

We also used the training hyper-parameters below:
また訓練時には次のパイパーパラメータを使用しています。

lr: 5e-5
num_train_epochs: 2
lora_target_modules: "all-linear"
lora_r: 8
lora_alpha: 8
lora_dropout: 0.05

The details of the experimental conditions will be released later.
実験条件の詳細については後日公開予定です。

Evaluation Results

The accuracies in the table below are based on the simple recovery process applied to the TSV output in Step 3, by using the gold tokens from the test set of the UD dataset for the seven languages mentioned above.
次の表に記載した精度は、前述の7言語のUDデータセットのtestセットの正解トークンを用いて、Step 3のTSV出力に簡易なリカバリ処理を適用した上で評価を行っています。

dataset	UPOS	UAS	LAS
UD_English-EWT	0.982	0.951	0.937
UD_Japanese-GSD	0.987	0.952	0.939
UD_Chinese-GSDSimp	0.972	0.889	0.862
UD_Korean-GSD	0.970	0.898	0.868
UD_French-GSD	0.981	0.956	0.943
UD_German-GSD	0.974	0.908	0.873
UD_Slovenian-SSJ	0.989	0.954	0.939

Framework versions

TRL v0.15.2 (for training)
PEFT v0.14.0 (for training)
vLLM 0.7.2 (for inference)

Citation

@article{matsuda-nl263,
  title={大規模言語モデルによる対話型依存構造解析},
  author={松田寛},
  journal={研究報告自然言語処理 (NL)},
  volume={2025},
  number={17},
  pages={1--7},
  year={2025},
  publisher={情報処理学会}
}

megagonlabs
/

prompt-based-parsing-gemma-2-9b-lora-v1