Go Inoue commited on
Commit
fd39d7b
1 Parent(s): 89437c2
Files changed (1) hide show
  1. README.md +16 -20
README.md CHANGED
@@ -3,7 +3,7 @@ language:
3
  - ar
4
  license: apache-2.0
5
  widget:
6
- - text: "الهدف من الحياة هو [MASK] ."
7
  ---
8
 
9
  # bert-base-camelbert-msa
@@ -18,7 +18,7 @@ We release eight models with different sizes and variants as follows:
18
  |-|-|:-:|-:|-:|
19
  ||`bert-base-camelbert-mix`|CA,DA,MSA|167GB|17.3B|
20
  ||`bert-base-camelbert-ca`|CA|6GB|847M|
21
- |✔|`bert-base-camelbert-da`|DA|54GB|5.8B|
22
  ||`bert-base-camelbert-msa`|MSA|107GB|12.6B|
23
  ||`bert-base-camelbert-msa-half`|MSA|53GB|6.3B|
24
  ||`bert-base-camelbert-msa-quarter`|MSA|27GB|3.1B|
@@ -37,27 +37,27 @@ You can use this model directly with a pipeline for masked language modeling:
37
  ```python
38
  >>> from transformers import pipeline
39
  >>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-camelbert-da')
40
- >>> unmasker("الهدف من الحياة هو [MASK] .")
41
- [{'sequence': '[CLS] الهدف من الحياة هو.. [SEP]',
42
  'score': 0.062508225440979,
43
  'token': 18,
44
  'token_str': '.'},
45
- {'sequence': '[CLS] الهدف من الحياة هو الموت. [SEP]',
46
  'score': 0.033172328025102615,
47
  'token': 4295,
48
- 'token_str': 'الموت'},
49
- {'sequence': '[CLS] الهدف من الحياة هو الحياة. [SEP]',
50
  'score': 0.029575437307357788,
51
  'token': 3696,
52
- 'token_str': 'الحياة'},
53
- {'sequence': '[CLS] الهدف من الحياة هو الرحيل. [SEP]',
54
  'score': 0.02724040113389492,
55
  'token': 11449,
56
- 'token_str': 'الرحيل'},
57
- {'sequence': '[CLS] الهدف من الحياة هو الحب. [SEP]',
58
  'score': 0.01564178802073002,
59
  'token': 3088,
60
- 'token_str': 'الحب'}]
61
  ```
62
 
63
  Here is how to use this model to get the features of a given text in PyTorch:
@@ -65,7 +65,7 @@ Here is how to use this model to get the features of a given text in PyTorch:
65
  from transformers import AutoTokenizer, AutoModel
66
  tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
67
  model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
68
- text = "مرحبا يا عالم."
69
  encoded_input = tokenizer(text, return_tensors='pt')
70
  output = model(**encoded_input)
71
  ```
@@ -75,18 +75,14 @@ and in TensorFlow:
75
  from transformers import AutoTokenizer, TFAutoModel
76
  tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
77
  model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
78
- text = "مرحبا يا عالم."
79
  encoded_input = tokenizer(text, return_tensors='tf')
80
  output = model(encoded_input)
81
  ```
82
 
83
  ## Training data
84
- - MSA
85
- - [The Arabic Gigaword Fifth Edition](https://catalog.ldc.upenn.edu/LDC2011T11)
86
- - [Abu El-Khair Corpus](http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus)
87
- - [OSIAN corpus](https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian)
88
- - [Arabic Wikipedia](https://archive.org/details/arwiki-20190201)
89
- - The unshuffled version of the Arabic [OSCAR corpus](https://oscar-corpus.com/)
90
 
91
  ## Training procedure
92
  We use [the original implementation](https://github.com/google-research/bert) released by Google for pre-training.
 
3
  - ar
4
  license: apache-2.0
5
  widget:
6
+ - text: "\UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} [MASK] ."
7
  ---
8
 
9
  # bert-base-camelbert-msa
 
18
  |-|-|:-:|-:|-:|
19
  ||`bert-base-camelbert-mix`|CA,DA,MSA|167GB|17.3B|
20
  ||`bert-base-camelbert-ca`|CA|6GB|847M|
21
+ |\UTF{2714}|`bert-base-camelbert-da`|DA|54GB|5.8B|
22
  ||`bert-base-camelbert-msa`|MSA|107GB|12.6B|
23
  ||`bert-base-camelbert-msa-half`|MSA|53GB|6.3B|
24
  ||`bert-base-camelbert-msa-quarter`|MSA|27GB|3.1B|
 
37
  ```python
38
  >>> from transformers import pipeline
39
  >>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-camelbert-da')
40
+ >>> unmasker("\UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} [MASK] .")
41
+ [{'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648}.. [SEP]',
42
  'score': 0.062508225440979,
43
  'token': 18,
44
  'token_str': '.'},
45
+ {'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{0645}\UTF{0648}\UTF{062A}. [SEP]',
46
  'score': 0.033172328025102615,
47
  'token': 4295,
48
+ 'token_str': '\UTF{0627}\UTF{0644}\UTF{0645}\UTF{0648}\UTF{062A}'},
49
+ {'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629}. [SEP]',
50
  'score': 0.029575437307357788,
51
  'token': 3696,
52
+ 'token_str': '\UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629}'},
53
+ {'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{0631}\UTF{062D}\UTF{064A}\UTF{0644}. [SEP]',
54
  'score': 0.02724040113389492,
55
  'token': 11449,
56
+ 'token_str': '\UTF{0627}\UTF{0644}\UTF{0631}\UTF{062D}\UTF{064A}\UTF{0644}'},
57
+ {'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{0628}. [SEP]',
58
  'score': 0.01564178802073002,
59
  'token': 3088,
60
+ 'token_str': '\UTF{0627}\UTF{0644}\UTF{062D}\UTF{0628}'}]
61
  ```
62
 
63
  Here is how to use this model to get the features of a given text in PyTorch:
 
65
  from transformers import AutoTokenizer, AutoModel
66
  tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
67
  model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
68
+ text = "\UTF{0645}\UTF{0631}\UTF{062D}\UTF{0628}\UTF{0627} \UTF{064A}\UTF{0627} \UTF{0639}\UTF{0627}\UTF{0644}\UTF{0645}."
69
  encoded_input = tokenizer(text, return_tensors='pt')
70
  output = model(**encoded_input)
71
  ```
 
75
  from transformers import AutoTokenizer, TFAutoModel
76
  tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
77
  model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
78
+ text = "\UTF{0645}\UTF{0631}\UTF{062D}\UTF{0628}\UTF{0627} \UTF{064A}\UTF{0627} \UTF{0639}\UTF{0627}\UTF{0644}\UTF{0645}."
79
  encoded_input = tokenizer(text, return_tensors='tf')
80
  output = model(encoded_input)
81
  ```
82
 
83
  ## Training data
84
+ - DA
85
+ - A collection of dialectal Arabic data described in our paper.
 
 
 
 
86
 
87
  ## Training procedure
88
  We use [the original implementation](https://github.com/google-research/bert) released by Google for pre-training.