ku-nlp
/

gpt2-small-japanese-char

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

murawaki commited on Apr 21, 2023

Commit

92192a3

•

1 Parent(s): 29856d6

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -41,6 +41,8 @@ You can also use this model to get the features of a given text.
 A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.
 ## Training data
 We used the following corpora for pre-training:

 A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.
+Note that the tokenizer maps U+0020 to `[UNK]` because preprocessing eliminated whitespace characters (U+0020) from training data.
 ## Training data
 We used the following corpora for pre-training: