Update README.md
Browse files
README.md
CHANGED
@@ -41,6 +41,8 @@ You can also use this model to get the features of a given text.
|
|
41 |
|
42 |
A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.
|
43 |
|
|
|
|
|
44 |
## Training data
|
45 |
|
46 |
We used the following corpora for pre-training:
|
|
|
41 |
|
42 |
A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.
|
43 |
|
44 |
+
Note that the tokenizer maps U+0020 to `[UNK]` because preprocessing eliminated whitespace characters (U+0020) from training data.
|
45 |
+
|
46 |
## Training data
|
47 |
|
48 |
We used the following corpora for pre-training:
|