uer commited on
Commit
d611421
1 Parent(s): a74565b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -4
README.md CHANGED
@@ -12,11 +12,23 @@ widget:
12
 
13
  ## Model description
14
 
15
- The model is used to generate Chinese texts. You can download the model either from the [GPT2-Chinese Github page](https://github.com/Morizeyao/GPT2-Chinese), or via HuggingFace from the link [gpt2-distil-chinese-cluecorpussmall](https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall). The model is called GPT2-distil because the configuration of model follows [distilgpt2](https://huggingface.co/distilgpt2), which has 6 layers, 768 dimension, and 12 heads. The pre-training does not involve the supervision of larger models.
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## How to use
18
 
19
- You can use the model directly with a pipeline for text generation:
20
 
21
  ```python
22
  >>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
@@ -33,7 +45,9 @@ You can use the model directly with a pipeline for text generation:
33
 
34
  ## Training procedure
35
 
36
- The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
 
 
37
 
38
  Stage1:
39
 
@@ -82,6 +96,71 @@ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluec
82
  --layers_num 6
83
  ```
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ### BibTeX entry and citation info
86
 
87
  ```
@@ -98,4 +177,17 @@ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluec
98
  pages={241},
99
  year={2019}
100
  }
101
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  ## Model description
14
 
15
+ The set of GPT2 models, except for GPT2-xlarge model, are pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658). The GPT2-xlarge model is pre-trained by [TencentPretrain](https://github.com/Tencent/TencentPretrain) introduced in [this paper](https://arxiv.org/abs/2212.06385), which inherits UER-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. Besides, the other models could also be pre-trained by TencentPretrain.
16
+
17
+ The model is used to generate Chinese texts. You can download the set of Chinese GPT2 models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
18
+
19
+ | | Link |
20
+ | ----------------- | :----------------------------: |
21
+ | **GPT2-distil** | [**L=6/H=768**][distil] |
22
+ | **GPT2** | [**L=12/H=768**][base] |
23
+ | **GPT2-medium** | [**L=24/H=1024**][medium] |
24
+ | **GPT2-large** | [**L=36/H=1280**][large] |
25
+ | **GPT2-xlarge** | [**L=48/H=1600**][xlarge] |
26
+
27
+ Note that the 6-layer model is called GPT2-distil model because it follows the configuration of [distilgpt2](https://huggingface.co/distilgpt2), and the pre-training does not involve the supervision of larger models.
28
 
29
  ## How to use
30
 
31
+ You can use the model directly with a pipeline for text generation (take the case of GPT2-distil):
32
 
33
  ```python
34
  >>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
 
45
 
46
  ## Training procedure
47
 
48
+ The GPT2-xlarge model is pre-trained by [TencentPretrain](https://github.com/Tencent/TencentPretrain), and the others are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
49
+
50
+ For the models pre-trained by UER-py, take the case of GPT2-distil
51
 
52
  Stage1:
53
 
 
96
  --layers_num 6
97
  ```
98
 
99
+ For GPT2-xlarge model, we use TencetPretrain.
100
+
101
+ Stage1:
102
+
103
+ ```
104
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
105
+ --vocab_path models/google_zh_vocab.txt \
106
+ --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
107
+ --seq_length 128 --processes_num 32 --data_processor lm
108
+ ```
109
+
110
+ ```
111
+ deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
112
+ --dataset_path corpora/cluecorpussmall_lm_seq128_dataset.pt \
113
+ --vocab_path models/google_zh_vocab.txt \
114
+ --config_path models/gpt2/xlarge_config.json \
115
+ --output_model_path models/cluecorpussmall_gpt2_xlarge_seq128 \
116
+ --world_size 8 --batch_size 64 \
117
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
118
+ --deepspeed_checkpoint_activations --deepspeed_checkpoint_layers_num 24
119
+ ```
120
+
121
+ Before stage2, we extract fp32 consolidated weights from a zero 2 and 3 DeepSpeed checkpoints:
122
+
123
+ ```
124
+ python3 models/cluecorpussmall_gpt2_xlarge_seq128/zero_to_fp32.py \
125
+ models/cluecorpussmall_gpt2_xlarge_seq128/ models/cluecorpussmall_gpt2_xlarge_seq128.bin
126
+ ```
127
+
128
+ Stage2:
129
+
130
+ ```
131
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
132
+ --vocab_path models/google_zh_vocab.txt \
133
+ --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
134
+ --seq_length 1024 --processes_num 32 --data_processor lm
135
+ ```
136
+
137
+ ```
138
+ deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
139
+ --dataset_path corpora/cluecorpussmall_lm_seq1024_dataset.pt \
140
+ --vocab_path models/google_zh_vocab.txt \
141
+ --config_path models/gpt2/xlarge_config.json \
142
+ --pretrained_model_path models/cluecorpussmall_gpt2_xlarge_seq128.bin \
143
+ --output_model_path models/cluecorpussmall_gpt2_xlarge_seq1024_stage2 \
144
+ --world_size 8 --batch_size 16 --learning_rate 5e-5 \
145
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
146
+ --deepspeed_checkpoint_activations --deepspeed_checkpoint_layers_num 6
147
+ ```
148
+
149
+ Then, we extract fp32 consolidated weights from a zero 2 and 3 DeepSpeed checkpoints:
150
+
151
+ ```
152
+ python3 models/cluecorpussmall_gpt2_xlarge_seq1024_stage2/zero_to_fp32.py \
153
+ models/cluecorpussmall_gpt2_xlarge_seq1024_stage2/ models/cluecorpussmall_gpt2_xlarge_seq1024_stage2.bin
154
+ ```
155
+
156
+ Finally, we convert the pre-trained model into Huggingface's format:
157
+
158
+ ```
159
+ python3 scripts/convert_gpt2_from_tencentpretrain_to_huggingface.py --input_model_path models/cluecorpussmall_gpt2_xlarge_seq1024_stage2.bin \
160
+ --output_model_path pytorch_model.bin \
161
+ --layers_num 48
162
+ ```
163
+
164
  ### BibTeX entry and citation info
165
 
166
  ```
 
177
  pages={241},
178
  year={2019}
179
  }
180
+
181
+ @article{zhao2023tencentpretrain,
182
+ title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
183
+ author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
184
+ journal={ACL 2023},
185
+ pages={217},
186
+ year={2023}
187
+ ```
188
+
189
+ [distil]:https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall
190
+ [base]:https://huggingface.co/uer/gpt2-chinese-cluecorpussmall
191
+ [medium]:https://huggingface.co/uer/gpt2-medium-chinese-cluecorpussmall
192
+ [large]:https://huggingface.co/uer/gpt2-large-chinese-cluecorpussmall
193
+ [xlarge]:https://huggingface.co/uer/gpt2-xlarge-chinese-cluecorpussmall