|
--- |
|
datasets: |
|
- botp/yentinglin-zh_TW_c4 |
|
language: |
|
- zh |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
### Model Sources |
|
- **Paper:** [BERT](https://arxiv.org/abs/1810.04805) |
|
|
|
## Uses |
|
|
|
#### Direct Use |
|
|
|
This model can be used for masked language modeling |
|
|
|
|
|
## Training |
|
|
|
#### Training Procedure |
|
* **type_vocab_size:** 2 |
|
* **vocab_size:** 21128 |
|
* **num_hidden_layers:** 12 |
|
|
|
#### Training Data |
|
botp/yentinglin-zh_TW_c4 |
|
|
|
## Evaluation |
|
|
|
| Dataset\BERT Pretrain | bert-based-chinese | ckiplab | GufoLab | |
|
| ------------- |:-------------:|:-------------:|:-------------:| |
|
| 5000 Tradition Chinese Dataset |0.7183| 0.6989| **0.8081**| |
|
| 10000 Sol-Idea Dataset | 0.7874| 0.7913| **0.8025**| |
|
| ALL DataSet | 0.7694| 0.7678| **0.8038**| |
|
|
|
#### Results |
|
|
|
| Test ID\Results | [MASK] Input | Result Output | |
|
| -------------|-------------|-------------| |
|
| 1|今天禮拜[MASK]?我[MASK]是很想[MASK]班。|今天禮拜六?我不是很想上班。 | |
|
| 2|[MASK]灣並[MASK]是[MASK]國不可分割的一部分。|臺灣並不是中國不可分割的一部分。 | |
|
| 3|如果可以是韋[MASK]安的最新歌[MASK]。|如果可以是韋禮安的最新歌曲。 | |
|
| 4|[MASK]水老[MASK]有賣很多鐵蛋的攤販。|淡水老街有賣很多鐵蛋的攤販。 | |
|
|
|
**git-lfs Installation** |
|
``` |
|
$ curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash |
|
$ sudo apt-get install git-lfs |
|
$ git lfs install |
|
$ pip install huggingface_hub |
|
|
|
``` |
|
## How to Get Started With the Model |
|
|
|
#### Login HuggingFace on Terminal |
|
|
|
``` |
|
$ huggingface-cli login |
|
Token:Your own huggingface token. |
|
``` |
|
|
|
#### Login HuggingFace on Jupyter Notebook |
|
|
|
``` |
|
from huggingface_hub import notebook_login |
|
|
|
notebook_login() |
|
Token:Your own huggingface token. |
|
``` |
|
|
|
#### Pyhon Code |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('Azion/bert-based-chinese', use_auth_token=True) |
|
|
|
model = AutoModelForMaskedLM.from_pretrained("Azion/bert-based-chinese", use_auth_token=True) |
|
|
|
``` |