Update README.md
Browse files
README.md
CHANGED
@@ -14,14 +14,15 @@ metrics:
|
|
14 |
- accuracy
|
15 |
mask_token: "[MASK]"
|
16 |
widget:
|
17 |
-
|
18 |
---
|
19 |
|
20 |
# Model Card for Japanese DeBERTa V2 large
|
21 |
|
22 |
## Model description
|
23 |
|
24 |
-
This is a Japanese DeBERTa V2 large model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the
|
|
|
25 |
|
26 |
## How to use
|
27 |
|
@@ -29,6 +30,7 @@ You can use this model for masked language modeling as follows:
|
|
29 |
|
30 |
```python
|
31 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
32 |
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
33 |
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
34 |
|
@@ -41,7 +43,9 @@ You can also fine-tune this model on downstream tasks.
|
|
41 |
|
42 |
## Tokenization
|
43 |
|
44 |
-
The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in
|
|
|
|
|
45 |
|
46 |
## Training data
|
47 |
|
@@ -52,14 +56,17 @@ We used the following corpora for pre-training:
|
|
52 |
- Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)
|
53 |
|
54 |
Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR.
|
55 |
-
Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of
|
|
|
56 |
|
57 |
## Training procedure
|
58 |
|
59 |
We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
|
60 |
-
Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC))
|
|
|
61 |
|
62 |
-
We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese DeBERTa model
|
|
|
63 |
The training took 36 days using 8 NVIDIA A100-SXM4-40GB GPUs.
|
64 |
|
65 |
The following hyperparameters were used during pre-training:
|
@@ -82,18 +89,23 @@ The evaluation set consists of 5,000 randomly sampled documents from each of the
|
|
82 |
## Fine-tuning on NLU tasks
|
83 |
|
84 |
We fine-tuned the following models and evaluated them on the dev set of JGLUE.
|
85 |
-
We tuned learning rate and training epochs for each model and task
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
86 |
|
87 |
-
|
88 |
-
|-------------------------------|-------------|---------------|----------|-----------|-----------|------------|
|
89 |
-
| Waseda RoBERTa base | 0.965 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
|
90 |
-
| Waseda RoBERTa large (seq512) | 0.969 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
|
91 |
-
| LUKE Japanese base* | 0.965 | 0.877 | 0.912 | - | - | 0.842 |
|
92 |
-
| LUKE Japanese large* | 0.965 | 0.902 | 0.927 | - | - | 0.893 |
|
93 |
-
| DeBERTaV2 base | 0.970 | 0.886 | 0.922 | 0.899 | 0.951 | 0.873 |
|
94 |
-
| DeBERTaV2 large | 0.968 | 0.892 | 0.924 | 0.912 | 0.959 | 0.890 |
|
95 |
|
96 |
## Acknowledgments
|
97 |
|
98 |
-
This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (
|
|
|
|
|
99 |
For training models, we used the mdx: a platform for the data-driven future.
|
|
|
14 |
- accuracy
|
15 |
mask_token: "[MASK]"
|
16 |
widget:
|
17 |
+
- text: "京都 大学 で 自然 言語 処理 を [MASK] する 。"
|
18 |
---
|
19 |
|
20 |
# Model Card for Japanese DeBERTa V2 large
|
21 |
|
22 |
## Model description
|
23 |
|
24 |
+
This is a Japanese DeBERTa V2 large model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the
|
25 |
+
Japanese portion of OSCAR.
|
26 |
|
27 |
## How to use
|
28 |
|
|
|
30 |
|
31 |
```python
|
32 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
33 |
+
|
34 |
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
35 |
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
36 |
|
|
|
43 |
|
44 |
## Tokenization
|
45 |
|
46 |
+
The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in
|
47 |
+
advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each
|
48 |
+
word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
|
49 |
|
50 |
## Training data
|
51 |
|
|
|
56 |
- Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)
|
57 |
|
58 |
Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR.
|
59 |
+
Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of
|
60 |
+
CC-100 and OSCAR. As a result, the total size of the training data is 171GB.
|
61 |
|
62 |
## Training procedure
|
63 |
|
64 |
We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
|
65 |
+
Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC))
|
66 |
+
and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
|
67 |
|
68 |
+
We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese DeBERTa model
|
69 |
+
using [transformers](https://github.com/huggingface/transformers) library.
|
70 |
The training took 36 days using 8 NVIDIA A100-SXM4-40GB GPUs.
|
71 |
|
72 |
The following hyperparameters were used during pre-training:
|
|
|
89 |
## Fine-tuning on NLU tasks
|
90 |
|
91 |
We fine-tuned the following models and evaluated them on the dev set of JGLUE.
|
92 |
+
We tuned learning rate and training epochs for each model and task
|
93 |
+
following [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
|
94 |
+
|
95 |
+
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|
96 |
+
|-------------------------------|-------------|--------------|---------------|----------|-----------|-----------|------------|
|
97 |
+
| Waseda RoBERTa base | 0.965 | 0.913 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
|
98 |
+
| Waseda RoBERTa large (seq512) | 0.969 | 0.925 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
|
99 |
+
| LUKE Japanese base* | 0.965 | 0.916 | 0.877 | 0.912 | - | - | 0.842 |
|
100 |
+
| LUKE Japanese large* | 0.965 | 0.932 | 0.902 | 0.927 | - | - | 0.893 |
|
101 |
+
| DeBERTaV2 base | 0.970 | 0.922 | 0.886 | 0.922 | 0.899 | 0.951 | 0.873 |
|
102 |
+
| DeBERTaV2 large | 0.968 | 0.925 | 0.892 | 0.924 | 0.912 | 0.959 | 0.890 |
|
103 |
|
104 |
+
*The scores of LUKE are from [the official repository](https://github.com/studio-ousia/luke).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
|
106 |
## Acknowledgments
|
107 |
|
108 |
+
This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (
|
109 |
+
JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of
|
110 |
+
Large-Scale Japanese Language Models".
|
111 |
For training models, we used the mdx: a platform for the data-driven future.
|