--- language: - ja - zh - ko license: cc-by-sa-4.0 datasets: - wikipedia mask_token: "[MASK]" widget: - text: "早稲田大学で自然言語処理を[MASK]ぶ。" - text: "李白是[MASK]朝人。" - text: "불고기[MASK] 먹겠습니다." --- ### Model description - This model was trained on **ZH, JA, KO**'s Wikipedia (5 epochs). ### How to use ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small") model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small") ``` - Before you fine-tune downstream tasks, you don't need any text segmentation. - (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning) ### Morphological analysis tools - ZH: For Chinese, we use [LTP](https://github.com/HIT-SCIR/ltp). - JA: For Japanese, we use [Juman++](https://github.com/ku-nlp/jumanpp). - KO: For Korean, we use [KoNLPy](https://github.com/konlpy/konlpy)(Kkma class). ### Tokenization - We use character-based tokenization with **whole-word-masking** strategy. ### Model size - vocab_size: 15015 - num_hidden_layers: 4 - hidden_size: 512 - num_attention_heads: 8 - param_num: 25M