Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ The codes for the pretraining are available at [retarfi/language-pretraining](ht
|
|
22 |
|
23 |
## Model architecture
|
24 |
|
25 |
-
|
26 |
|
27 |
## Training Data
|
28 |
|
@@ -40,7 +40,7 @@ The vocabulary size is 32768.
|
|
40 |
|
41 |
## Training
|
42 |
|
43 |
-
The models are trained with the same configuration as ELECTRA small in the [original ELECTRA paper](https://arxiv.org/abs/2003.10555); 128 tokens per instance, 128 instances per batch, and 1M training steps.
|
44 |
|
45 |
The size of the generator is the same of the discriminator.
|
46 |
|
|
|
22 |
|
23 |
## Model architecture
|
24 |
|
25 |
+
The model architecture is the same as ELECTRA small in the [original ELECTRA implementation](https://github.com/google-research/electra); 12 layers, 256 dimensions of hidden states, and 4 attention heads.
|
26 |
|
27 |
## Training Data
|
28 |
|
|
|
40 |
|
41 |
## Training
|
42 |
|
43 |
+
The models are trained with the same configuration as ELECTRA small in the [original ELECTRA paper](https://arxiv.org/abs/2003.10555) except size; 128 tokens per instance, 128 instances per batch, and 1M training steps.
|
44 |
|
45 |
The size of the generator is the same of the discriminator.
|
46 |
|