Update README.md
Browse files
README.md
CHANGED
@@ -40,7 +40,7 @@ You can also use this model to get the features of a given text.
|
|
40 |
|
41 |
## Vocabulary
|
42 |
|
43 |
-
|
44 |
|
45 |
## Training data
|
46 |
|
@@ -55,7 +55,7 @@ Also note that Japanese Wikipedia was duplicated 10 times to make the total size
|
|
55 |
|
56 |
## Training procedure
|
57 |
|
58 |
-
The training took
|
59 |
|
60 |
The following hyperparameters were used during pre-training:
|
61 |
|
|
|
40 |
|
41 |
## Vocabulary
|
42 |
|
43 |
+
A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never transgressed character boundaries.
|
44 |
|
45 |
## Training data
|
46 |
|
|
|
55 |
|
56 |
## Training procedure
|
57 |
|
58 |
+
The training took about 3 months (with two interruptions) with a single NVIDIA A100 80GB GPU.
|
59 |
|
60 |
The following hyperparameters were used during pre-training:
|
61 |
|