ToastyPigeon commited on
Commit
82e2b01
·
verified ·
1 Parent(s): af115cd

Training in progress, step 375, checkpoint

Browse files
checkpoint-375/README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: internlm/internlm3-8b-instruct
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.14.0
checkpoint-375/adapter_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "internlm/internlm3-8b-instruct",
5
+ "bias": "none",
6
+ "eva_config": null,
7
+ "exclude_modules": null,
8
+ "fan_in_fan_out": null,
9
+ "inference_mode": true,
10
+ "init_lora_weights": true,
11
+ "layer_replication": null,
12
+ "layers_pattern": null,
13
+ "layers_to_transform": null,
14
+ "loftq_config": {},
15
+ "lora_alpha": 64,
16
+ "lora_bias": false,
17
+ "lora_dropout": 0.25,
18
+ "megatron_config": null,
19
+ "megatron_core": "megatron.core",
20
+ "modules_to_save": null,
21
+ "peft_type": "LORA",
22
+ "r": 32,
23
+ "rank_pattern": {},
24
+ "revision": null,
25
+ "target_modules": [
26
+ "v_proj",
27
+ "gate_proj",
28
+ "down_proj",
29
+ "k_proj",
30
+ "o_proj",
31
+ "q_proj",
32
+ "up_proj"
33
+ ],
34
+ "task_type": "CAUSAL_LM",
35
+ "use_dora": false,
36
+ "use_rslora": false
37
+ }
checkpoint-375/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8d71366d9cc2c09a22a5d09e4263c8cfef955840bdc71971d5e6de4b4020cb5
3
+ size 2308615184
checkpoint-375/global_step375/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6777fe9d01b773c5c60c8dc3e3fdfe2c58ffb7a9dee929d89c9ae3b374df12ba
3
+ size 187091776
checkpoint-375/global_step375/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3650c45007155caeb23626f5d6df1ff136ed1c1b430af95edd58c9208cb3fa81
3
+ size 187091776
checkpoint-375/global_step375/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bcb2f8cccd870d4680b92794df34756848bae4a73068b88a5012385e2643be41
3
+ size 187091776
checkpoint-375/global_step375/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9bbc96108858eaf05b95a352784a9937a3080058f587c4ef55def7179061ca64
3
+ size 187091776
checkpoint-375/global_step375/zero_pp_rank_0_mp_rank_00_model_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:24bbd85b65d42346d14b87b9605b56bf4d0a64a6cfdee8b8a89e0dacd6d58a86
3
+ size 124777254
checkpoint-375/global_step375/zero_pp_rank_1_mp_rank_00_model_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abf6cf8650efb42b9c9205db6d0db0cdad598ef42df5d881699f6afc58d11004
3
+ size 124777254
checkpoint-375/global_step375/zero_pp_rank_2_mp_rank_00_model_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b18c1501f4876f1c514b39a2c637247743cf5d0104ca24c1ca840b7e3fa49da
3
+ size 124777254
checkpoint-375/global_step375/zero_pp_rank_3_mp_rank_00_model_states.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af1666d267f29623ce4468b4a7a58e795135ae9d42ec798a3079fd4a6b939c97
3
+ size 124777254
checkpoint-375/latest ADDED
@@ -0,0 +1 @@
 
 
1
+ global_step375
checkpoint-375/rng_state_0.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db27ed1d9351ecef9f6fa8e0a0344db373dcd44e5cf66868a18b1d3d051d42b8
3
+ size 14960
checkpoint-375/rng_state_1.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:326e222cf3fe2d14248046ad69a2c46c167e8e958b2e42179c6d85edb333ee49
3
+ size 14960
checkpoint-375/rng_state_2.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ace3ace7aec43fd1c6df07e0ad33bf89240b4b73fb9ea5d400f0b3191b943177
3
+ size 14960
checkpoint-375/rng_state_3.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0564fb18fa63fc0cc5d8a09463435503edb50ae52cc139d2ee5292cbba60a6bc
3
+ size 14960
checkpoint-375/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8365a1287812bae366360ab7bc14486613ac684c2699e0c42d74f156faff0bd
3
+ size 1064
checkpoint-375/special_tokens_map.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|action_start|>",
6
+ "<|action_end|>",
7
+ "<|interpreter|>",
8
+ "<|plugin|>",
9
+ "<restate>",
10
+ "</restate>",
11
+ "<planning>",
12
+ "</planning>",
13
+ "<recollect>",
14
+ "</recollect>",
15
+ "<execution>",
16
+ "</execution>",
17
+ "<review>",
18
+ "</review>",
19
+ "<summarize>",
20
+ "</summarize>",
21
+ "<retry>",
22
+ "</retry>",
23
+ "<conclude>",
24
+ "</conclude>"
25
+ ],
26
+ "bos_token": {
27
+ "content": "<s>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ },
33
+ "eos_token": {
34
+ "content": "</s>",
35
+ "lstrip": false,
36
+ "normalized": false,
37
+ "rstrip": false,
38
+ "single_word": false
39
+ },
40
+ "pad_token": {
41
+ "content": "</s>",
42
+ "lstrip": false,
43
+ "normalized": false,
44
+ "rstrip": false,
45
+ "single_word": false
46
+ },
47
+ "unk_token": {
48
+ "content": "<unk>",
49
+ "lstrip": false,
50
+ "normalized": false,
51
+ "rstrip": false,
52
+ "single_word": false
53
+ }
54
+ }
checkpoint-375/tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bcacff3229854f5103ee7a85473a30ca9a8b3a68f3aae9b7479574b23ac2256b
3
+ size 2475075
checkpoint-375/tokenizer_config.json ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "128111": {
31
+ "content": "<restate>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "128112": {
39
+ "content": "</restate>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "128113": {
47
+ "content": "<planning>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "128114": {
55
+ "content": "</planning>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "128115": {
63
+ "content": "<recollect>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "128116": {
71
+ "content": "</recollect>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "128117": {
79
+ "content": "<execution>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "128118": {
87
+ "content": "</execution>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "128119": {
95
+ "content": "<review>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "128120": {
103
+ "content": "</review>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "128121": {
111
+ "content": "<summarize>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "128122": {
119
+ "content": "</summarize>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": true
125
+ },
126
+ "128123": {
127
+ "content": "<retry>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": true
133
+ },
134
+ "128124": {
135
+ "content": "</retry>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": true
141
+ },
142
+ "128125": {
143
+ "content": "<conclude>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": true
149
+ },
150
+ "128126": {
151
+ "content": "</conclude>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": true
157
+ },
158
+ "128127": {
159
+ "content": "<|plugin|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": true
165
+ },
166
+ "128128": {
167
+ "content": "<|interpreter|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": true
173
+ },
174
+ "128129": {
175
+ "content": "<|action_end|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": true
181
+ },
182
+ "128130": {
183
+ "content": "<|action_start|>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": true
189
+ },
190
+ "128131": {
191
+ "content": "<|im_end|>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": true
197
+ },
198
+ "128132": {
199
+ "content": "<|im_start|>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": true
205
+ }
206
+ },
207
+ "additional_special_tokens": [
208
+ "<|im_start|>",
209
+ "<|im_end|>",
210
+ "<|action_start|>",
211
+ "<|action_end|>",
212
+ "<|interpreter|>",
213
+ "<|plugin|>",
214
+ "<restate>",
215
+ "</restate>",
216
+ "<planning>",
217
+ "</planning>",
218
+ "<recollect>",
219
+ "</recollect>",
220
+ "<execution>",
221
+ "</execution>",
222
+ "<review>",
223
+ "</review>",
224
+ "<summarize>",
225
+ "</summarize>",
226
+ "<retry>",
227
+ "</retry>",
228
+ "<conclude>",
229
+ "</conclude>"
230
+ ],
231
+ "auto_map": {
232
+ "AutoTokenizer": [
233
+ "internlm/internlm3-8b-instruct--tokenization_internlm3.InternLM3Tokenizer",
234
+ null
235
+ ]
236
+ },
237
+ "bos_token": "<s>",
238
+ "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
239
+ "clean_up_tokenization_spaces": false,
240
+ "eos_token": "</s>",
241
+ "extra_special_tokens": {},
242
+ "model_max_length": 1000000000000000019884624838656,
243
+ "pad_token": "</s>",
244
+ "sp_model_kwargs": {},
245
+ "spaces_between_special_tokens": false,
246
+ "tokenizer_class": "InternLM3Tokenizer",
247
+ "unk_token": "<unk>",
248
+ "use_default_system_prompt": false
249
+ }
checkpoint-375/trainer_state.json ADDED
@@ -0,0 +1,2706 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.5,
5
+ "eval_steps": 75,
6
+ "global_step": 375,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0013333333333333333,
13
+ "grad_norm": 5.494853571957073,
14
+ "learning_rate": 1.5e-06,
15
+ "loss": 2.2794,
16
+ "step": 1
17
+ },
18
+ {
19
+ "epoch": 0.0013333333333333333,
20
+ "eval_loss": 1.8317261934280396,
21
+ "eval_runtime": 98.252,
22
+ "eval_samples_per_second": 1.018,
23
+ "eval_steps_per_second": 0.254,
24
+ "step": 1
25
+ },
26
+ {
27
+ "epoch": 0.0026666666666666666,
28
+ "grad_norm": 0.4396391591358924,
29
+ "learning_rate": 3e-06,
30
+ "loss": 2.2089,
31
+ "step": 2
32
+ },
33
+ {
34
+ "epoch": 0.004,
35
+ "grad_norm": 0.6704432458412378,
36
+ "learning_rate": 4.5e-06,
37
+ "loss": 2.3789,
38
+ "step": 3
39
+ },
40
+ {
41
+ "epoch": 0.005333333333333333,
42
+ "grad_norm": 0.5524719828039701,
43
+ "learning_rate": 6e-06,
44
+ "loss": 1.9508,
45
+ "step": 4
46
+ },
47
+ {
48
+ "epoch": 0.006666666666666667,
49
+ "grad_norm": 1.551345196476793,
50
+ "learning_rate": 7.5e-06,
51
+ "loss": 2.3302,
52
+ "step": 5
53
+ },
54
+ {
55
+ "epoch": 0.008,
56
+ "grad_norm": 1.4425623787692365,
57
+ "learning_rate": 9e-06,
58
+ "loss": 2.4649,
59
+ "step": 6
60
+ },
61
+ {
62
+ "epoch": 0.009333333333333334,
63
+ "grad_norm": 1.9558403413673842,
64
+ "learning_rate": 1.05e-05,
65
+ "loss": 2.5318,
66
+ "step": 7
67
+ },
68
+ {
69
+ "epoch": 0.010666666666666666,
70
+ "grad_norm": 0.6930931382639645,
71
+ "learning_rate": 1.2e-05,
72
+ "loss": 1.9398,
73
+ "step": 8
74
+ },
75
+ {
76
+ "epoch": 0.012,
77
+ "grad_norm": 0.40522998481846323,
78
+ "learning_rate": 1.3500000000000001e-05,
79
+ "loss": 2.4355,
80
+ "step": 9
81
+ },
82
+ {
83
+ "epoch": 0.013333333333333334,
84
+ "grad_norm": 2.194079323679347,
85
+ "learning_rate": 1.5e-05,
86
+ "loss": 2.3417,
87
+ "step": 10
88
+ },
89
+ {
90
+ "epoch": 0.014666666666666666,
91
+ "grad_norm": 2.5436460717739156,
92
+ "learning_rate": 1.65e-05,
93
+ "loss": 2.4571,
94
+ "step": 11
95
+ },
96
+ {
97
+ "epoch": 0.016,
98
+ "grad_norm": 1.9496092240793341,
99
+ "learning_rate": 1.8e-05,
100
+ "loss": 1.9637,
101
+ "step": 12
102
+ },
103
+ {
104
+ "epoch": 0.017333333333333333,
105
+ "grad_norm": 1.2616722727237577,
106
+ "learning_rate": 1.95e-05,
107
+ "loss": 2.5866,
108
+ "step": 13
109
+ },
110
+ {
111
+ "epoch": 0.018666666666666668,
112
+ "grad_norm": 0.3060862022495338,
113
+ "learning_rate": 2.1e-05,
114
+ "loss": 2.1547,
115
+ "step": 14
116
+ },
117
+ {
118
+ "epoch": 0.02,
119
+ "grad_norm": 0.5028816367161089,
120
+ "learning_rate": 2.25e-05,
121
+ "loss": 1.8957,
122
+ "step": 15
123
+ },
124
+ {
125
+ "epoch": 0.021333333333333333,
126
+ "grad_norm": 0.5034069784636721,
127
+ "learning_rate": 2.4e-05,
128
+ "loss": 2.1151,
129
+ "step": 16
130
+ },
131
+ {
132
+ "epoch": 0.02266666666666667,
133
+ "grad_norm": 1.5564960276262187,
134
+ "learning_rate": 2.55e-05,
135
+ "loss": 2.3041,
136
+ "step": 17
137
+ },
138
+ {
139
+ "epoch": 0.024,
140
+ "grad_norm": 0.4958603020666427,
141
+ "learning_rate": 2.7000000000000002e-05,
142
+ "loss": 2.0962,
143
+ "step": 18
144
+ },
145
+ {
146
+ "epoch": 0.025333333333333333,
147
+ "grad_norm": 1.5362995691059922,
148
+ "learning_rate": 2.8499999999999998e-05,
149
+ "loss": 2.3989,
150
+ "step": 19
151
+ },
152
+ {
153
+ "epoch": 0.02666666666666667,
154
+ "grad_norm": 0.648652241685575,
155
+ "learning_rate": 3e-05,
156
+ "loss": 2.5339,
157
+ "step": 20
158
+ },
159
+ {
160
+ "epoch": 0.028,
161
+ "grad_norm": 0.554470724602306,
162
+ "learning_rate": 2.999996958555301e-05,
163
+ "loss": 2.4793,
164
+ "step": 21
165
+ },
166
+ {
167
+ "epoch": 0.029333333333333333,
168
+ "grad_norm": 1.3508987707917082,
169
+ "learning_rate": 2.999987834234907e-05,
170
+ "loss": 2.3167,
171
+ "step": 22
172
+ },
173
+ {
174
+ "epoch": 0.030666666666666665,
175
+ "grad_norm": 0.4599504691462346,
176
+ "learning_rate": 2.9999726270799325e-05,
177
+ "loss": 2.36,
178
+ "step": 23
179
+ },
180
+ {
181
+ "epoch": 0.032,
182
+ "grad_norm": 1.0007538870075432,
183
+ "learning_rate": 2.999951337158897e-05,
184
+ "loss": 2.2342,
185
+ "step": 24
186
+ },
187
+ {
188
+ "epoch": 0.03333333333333333,
189
+ "grad_norm": 0.8436818785690442,
190
+ "learning_rate": 2.9999239645677304e-05,
191
+ "loss": 2.2803,
192
+ "step": 25
193
+ },
194
+ {
195
+ "epoch": 0.034666666666666665,
196
+ "grad_norm": 0.37337797248642607,
197
+ "learning_rate": 2.9998905094297686e-05,
198
+ "loss": 2.188,
199
+ "step": 26
200
+ },
201
+ {
202
+ "epoch": 0.036,
203
+ "grad_norm": 0.16596641795830927,
204
+ "learning_rate": 2.9998509718957563e-05,
205
+ "loss": 2.105,
206
+ "step": 27
207
+ },
208
+ {
209
+ "epoch": 0.037333333333333336,
210
+ "grad_norm": 0.3163756174860874,
211
+ "learning_rate": 2.9998053521438427e-05,
212
+ "loss": 2.1819,
213
+ "step": 28
214
+ },
215
+ {
216
+ "epoch": 0.03866666666666667,
217
+ "grad_norm": 0.73179742586677,
218
+ "learning_rate": 2.9997536503795834e-05,
219
+ "loss": 2.2281,
220
+ "step": 29
221
+ },
222
+ {
223
+ "epoch": 0.04,
224
+ "grad_norm": 0.18793418314598326,
225
+ "learning_rate": 2.9996958668359386e-05,
226
+ "loss": 2.1174,
227
+ "step": 30
228
+ },
229
+ {
230
+ "epoch": 0.04133333333333333,
231
+ "grad_norm": 0.6403343407451153,
232
+ "learning_rate": 2.999632001773272e-05,
233
+ "loss": 2.3419,
234
+ "step": 31
235
+ },
236
+ {
237
+ "epoch": 0.042666666666666665,
238
+ "grad_norm": 0.2773466547599625,
239
+ "learning_rate": 2.9995620554793495e-05,
240
+ "loss": 1.9779,
241
+ "step": 32
242
+ },
243
+ {
244
+ "epoch": 0.044,
245
+ "grad_norm": 0.42992322954286805,
246
+ "learning_rate": 2.999486028269338e-05,
247
+ "loss": 1.891,
248
+ "step": 33
249
+ },
250
+ {
251
+ "epoch": 0.04533333333333334,
252
+ "grad_norm": 0.3880877285520559,
253
+ "learning_rate": 2.9994039204858043e-05,
254
+ "loss": 1.9488,
255
+ "step": 34
256
+ },
257
+ {
258
+ "epoch": 0.04666666666666667,
259
+ "grad_norm": 0.38995745148913935,
260
+ "learning_rate": 2.999315732498714e-05,
261
+ "loss": 2.5013,
262
+ "step": 35
263
+ },
264
+ {
265
+ "epoch": 0.048,
266
+ "grad_norm": 0.4228621926559084,
267
+ "learning_rate": 2.999221464705427e-05,
268
+ "loss": 2.1356,
269
+ "step": 36
270
+ },
271
+ {
272
+ "epoch": 0.04933333333333333,
273
+ "grad_norm": 0.4663890399818879,
274
+ "learning_rate": 2.9991211175307006e-05,
275
+ "loss": 2.155,
276
+ "step": 37
277
+ },
278
+ {
279
+ "epoch": 0.050666666666666665,
280
+ "grad_norm": 0.5287155786829871,
281
+ "learning_rate": 2.9990146914266826e-05,
282
+ "loss": 1.8496,
283
+ "step": 38
284
+ },
285
+ {
286
+ "epoch": 0.052,
287
+ "grad_norm": 0.33925018851926647,
288
+ "learning_rate": 2.9989021868729135e-05,
289
+ "loss": 2.1315,
290
+ "step": 39
291
+ },
292
+ {
293
+ "epoch": 0.05333333333333334,
294
+ "grad_norm": 0.16522140027488588,
295
+ "learning_rate": 2.99878360437632e-05,
296
+ "loss": 2.0977,
297
+ "step": 40
298
+ },
299
+ {
300
+ "epoch": 0.05466666666666667,
301
+ "grad_norm": 0.6355838754971164,
302
+ "learning_rate": 2.998658944471217e-05,
303
+ "loss": 2.1459,
304
+ "step": 41
305
+ },
306
+ {
307
+ "epoch": 0.056,
308
+ "grad_norm": 0.26447590316147407,
309
+ "learning_rate": 2.9985282077193026e-05,
310
+ "loss": 2.0833,
311
+ "step": 42
312
+ },
313
+ {
314
+ "epoch": 0.05733333333333333,
315
+ "grad_norm": 0.29258668895913753,
316
+ "learning_rate": 2.9983913947096563e-05,
317
+ "loss": 2.2603,
318
+ "step": 43
319
+ },
320
+ {
321
+ "epoch": 0.058666666666666666,
322
+ "grad_norm": 0.2920264557478038,
323
+ "learning_rate": 2.9982485060587357e-05,
324
+ "loss": 1.9828,
325
+ "step": 44
326
+ },
327
+ {
328
+ "epoch": 0.06,
329
+ "grad_norm": 0.3860467745966656,
330
+ "learning_rate": 2.9980995424103748e-05,
331
+ "loss": 2.4214,
332
+ "step": 45
333
+ },
334
+ {
335
+ "epoch": 0.06133333333333333,
336
+ "grad_norm": 0.5757512413804123,
337
+ "learning_rate": 2.9979445044357814e-05,
338
+ "loss": 2.212,
339
+ "step": 46
340
+ },
341
+ {
342
+ "epoch": 0.06266666666666666,
343
+ "grad_norm": 0.21304621427134704,
344
+ "learning_rate": 2.9977833928335316e-05,
345
+ "loss": 1.9271,
346
+ "step": 47
347
+ },
348
+ {
349
+ "epoch": 0.064,
350
+ "grad_norm": 0.3301257012003079,
351
+ "learning_rate": 2.9976162083295694e-05,
352
+ "loss": 2.0472,
353
+ "step": 48
354
+ },
355
+ {
356
+ "epoch": 0.06533333333333333,
357
+ "grad_norm": 0.3008211448973632,
358
+ "learning_rate": 2.9974429516772018e-05,
359
+ "loss": 2.4323,
360
+ "step": 49
361
+ },
362
+ {
363
+ "epoch": 0.06666666666666667,
364
+ "grad_norm": 0.26704445986476444,
365
+ "learning_rate": 2.997263623657097e-05,
366
+ "loss": 2.003,
367
+ "step": 50
368
+ },
369
+ {
370
+ "epoch": 0.068,
371
+ "grad_norm": 0.24343367493238866,
372
+ "learning_rate": 2.9970782250772786e-05,
373
+ "loss": 1.9911,
374
+ "step": 51
375
+ },
376
+ {
377
+ "epoch": 0.06933333333333333,
378
+ "grad_norm": 0.36868774906810614,
379
+ "learning_rate": 2.9968867567731233e-05,
380
+ "loss": 2.3312,
381
+ "step": 52
382
+ },
383
+ {
384
+ "epoch": 0.07066666666666667,
385
+ "grad_norm": 0.4346673091737939,
386
+ "learning_rate": 2.9966892196073583e-05,
387
+ "loss": 2.0461,
388
+ "step": 53
389
+ },
390
+ {
391
+ "epoch": 0.072,
392
+ "grad_norm": 0.27732062938320917,
393
+ "learning_rate": 2.996485614470054e-05,
394
+ "loss": 1.8782,
395
+ "step": 54
396
+ },
397
+ {
398
+ "epoch": 0.07333333333333333,
399
+ "grad_norm": 0.5044098714622329,
400
+ "learning_rate": 2.9962759422786248e-05,
401
+ "loss": 2.3024,
402
+ "step": 55
403
+ },
404
+ {
405
+ "epoch": 0.07466666666666667,
406
+ "grad_norm": 0.35584299953539106,
407
+ "learning_rate": 2.9960602039778196e-05,
408
+ "loss": 1.9019,
409
+ "step": 56
410
+ },
411
+ {
412
+ "epoch": 0.076,
413
+ "grad_norm": 0.25515628575898364,
414
+ "learning_rate": 2.995838400539723e-05,
415
+ "loss": 2.1745,
416
+ "step": 57
417
+ },
418
+ {
419
+ "epoch": 0.07733333333333334,
420
+ "grad_norm": 0.22266755698926835,
421
+ "learning_rate": 2.9956105329637454e-05,
422
+ "loss": 2.3572,
423
+ "step": 58
424
+ },
425
+ {
426
+ "epoch": 0.07866666666666666,
427
+ "grad_norm": 0.23590376658875728,
428
+ "learning_rate": 2.9953766022766228e-05,
429
+ "loss": 2.2639,
430
+ "step": 59
431
+ },
432
+ {
433
+ "epoch": 0.08,
434
+ "grad_norm": 0.19958974114647998,
435
+ "learning_rate": 2.9951366095324108e-05,
436
+ "loss": 2.2983,
437
+ "step": 60
438
+ },
439
+ {
440
+ "epoch": 0.08133333333333333,
441
+ "grad_norm": 0.26170668739549613,
442
+ "learning_rate": 2.994890555812479e-05,
443
+ "loss": 1.7797,
444
+ "step": 61
445
+ },
446
+ {
447
+ "epoch": 0.08266666666666667,
448
+ "grad_norm": 0.1947500638232896,
449
+ "learning_rate": 2.9946384422255074e-05,
450
+ "loss": 2.232,
451
+ "step": 62
452
+ },
453
+ {
454
+ "epoch": 0.084,
455
+ "grad_norm": 0.18048707660964908,
456
+ "learning_rate": 2.9943802699074796e-05,
457
+ "loss": 2.3487,
458
+ "step": 63
459
+ },
460
+ {
461
+ "epoch": 0.08533333333333333,
462
+ "grad_norm": 0.21913225188044025,
463
+ "learning_rate": 2.994116040021681e-05,
464
+ "loss": 2.1964,
465
+ "step": 64
466
+ },
467
+ {
468
+ "epoch": 0.08666666666666667,
469
+ "grad_norm": 0.4242826633242974,
470
+ "learning_rate": 2.9938457537586896e-05,
471
+ "loss": 2.3137,
472
+ "step": 65
473
+ },
474
+ {
475
+ "epoch": 0.088,
476
+ "grad_norm": 0.2557454392806272,
477
+ "learning_rate": 2.9935694123363727e-05,
478
+ "loss": 2.1458,
479
+ "step": 66
480
+ },
481
+ {
482
+ "epoch": 0.08933333333333333,
483
+ "grad_norm": 0.3093808768168103,
484
+ "learning_rate": 2.9932870169998825e-05,
485
+ "loss": 2.1153,
486
+ "step": 67
487
+ },
488
+ {
489
+ "epoch": 0.09066666666666667,
490
+ "grad_norm": 0.33819459397450213,
491
+ "learning_rate": 2.9929985690216478e-05,
492
+ "loss": 2.0772,
493
+ "step": 68
494
+ },
495
+ {
496
+ "epoch": 0.092,
497
+ "grad_norm": 0.21377958208808176,
498
+ "learning_rate": 2.9927040697013705e-05,
499
+ "loss": 2.1949,
500
+ "step": 69
501
+ },
502
+ {
503
+ "epoch": 0.09333333333333334,
504
+ "grad_norm": 0.2652218777029591,
505
+ "learning_rate": 2.9924035203660188e-05,
506
+ "loss": 2.0031,
507
+ "step": 70
508
+ },
509
+ {
510
+ "epoch": 0.09466666666666666,
511
+ "grad_norm": 0.23574741210798578,
512
+ "learning_rate": 2.9920969223698202e-05,
513
+ "loss": 1.9449,
514
+ "step": 71
515
+ },
516
+ {
517
+ "epoch": 0.096,
518
+ "grad_norm": 0.2226926959097067,
519
+ "learning_rate": 2.991784277094258e-05,
520
+ "loss": 2.031,
521
+ "step": 72
522
+ },
523
+ {
524
+ "epoch": 0.09733333333333333,
525
+ "grad_norm": 0.24814250998530846,
526
+ "learning_rate": 2.9914655859480632e-05,
527
+ "loss": 2.0921,
528
+ "step": 73
529
+ },
530
+ {
531
+ "epoch": 0.09866666666666667,
532
+ "grad_norm": 0.33914254917033065,
533
+ "learning_rate": 2.991140850367208e-05,
534
+ "loss": 2.2183,
535
+ "step": 74
536
+ },
537
+ {
538
+ "epoch": 0.1,
539
+ "grad_norm": 0.25867270370266443,
540
+ "learning_rate": 2.990810071814901e-05,
541
+ "loss": 1.6416,
542
+ "step": 75
543
+ },
544
+ {
545
+ "epoch": 0.1,
546
+ "eval_loss": 1.782590389251709,
547
+ "eval_runtime": 98.6583,
548
+ "eval_samples_per_second": 1.014,
549
+ "eval_steps_per_second": 0.253,
550
+ "step": 75
551
+ },
552
+ {
553
+ "epoch": 0.10133333333333333,
554
+ "grad_norm": 0.22920181089355457,
555
+ "learning_rate": 2.990473251781578e-05,
556
+ "loss": 2.1863,
557
+ "step": 76
558
+ },
559
+ {
560
+ "epoch": 0.10266666666666667,
561
+ "grad_norm": 0.18274633553603156,
562
+ "learning_rate": 2.9901303917848977e-05,
563
+ "loss": 2.2399,
564
+ "step": 77
565
+ },
566
+ {
567
+ "epoch": 0.104,
568
+ "grad_norm": 0.18220877055008589,
569
+ "learning_rate": 2.9897814933697335e-05,
570
+ "loss": 1.9149,
571
+ "step": 78
572
+ },
573
+ {
574
+ "epoch": 0.10533333333333333,
575
+ "grad_norm": 0.2021601886002538,
576
+ "learning_rate": 2.9894265581081682e-05,
577
+ "loss": 2.199,
578
+ "step": 79
579
+ },
580
+ {
581
+ "epoch": 0.10666666666666667,
582
+ "grad_norm": 3.7616720925364286,
583
+ "learning_rate": 2.989065587599484e-05,
584
+ "loss": 2.1179,
585
+ "step": 80
586
+ },
587
+ {
588
+ "epoch": 0.108,
589
+ "grad_norm": 0.23407462358542114,
590
+ "learning_rate": 2.9886985834701577e-05,
591
+ "loss": 2.5661,
592
+ "step": 81
593
+ },
594
+ {
595
+ "epoch": 0.10933333333333334,
596
+ "grad_norm": 1.5338568979686038,
597
+ "learning_rate": 2.9883255473738523e-05,
598
+ "loss": 2.2263,
599
+ "step": 82
600
+ },
601
+ {
602
+ "epoch": 0.11066666666666666,
603
+ "grad_norm": 0.24500727274505268,
604
+ "learning_rate": 2.9879464809914113e-05,
605
+ "loss": 2.0621,
606
+ "step": 83
607
+ },
608
+ {
609
+ "epoch": 0.112,
610
+ "grad_norm": 0.21179059127462146,
611
+ "learning_rate": 2.987561386030848e-05,
612
+ "loss": 2.0356,
613
+ "step": 84
614
+ },
615
+ {
616
+ "epoch": 0.11333333333333333,
617
+ "grad_norm": 0.3306008087329227,
618
+ "learning_rate": 2.9871702642273404e-05,
619
+ "loss": 2.0848,
620
+ "step": 85
621
+ },
622
+ {
623
+ "epoch": 0.11466666666666667,
624
+ "grad_norm": 0.15657087812683942,
625
+ "learning_rate": 2.9867731173432215e-05,
626
+ "loss": 2.2127,
627
+ "step": 86
628
+ },
629
+ {
630
+ "epoch": 0.116,
631
+ "grad_norm": 0.1886500437129195,
632
+ "learning_rate": 2.9863699471679743e-05,
633
+ "loss": 1.9854,
634
+ "step": 87
635
+ },
636
+ {
637
+ "epoch": 0.11733333333333333,
638
+ "grad_norm": 0.22974349537556282,
639
+ "learning_rate": 2.9859607555182206e-05,
640
+ "loss": 2.1972,
641
+ "step": 88
642
+ },
643
+ {
644
+ "epoch": 0.11866666666666667,
645
+ "grad_norm": 0.2578002903522265,
646
+ "learning_rate": 2.9855455442377135e-05,
647
+ "loss": 2.1804,
648
+ "step": 89
649
+ },
650
+ {
651
+ "epoch": 0.12,
652
+ "grad_norm": 0.24109737247334548,
653
+ "learning_rate": 2.9851243151973314e-05,
654
+ "loss": 1.9523,
655
+ "step": 90
656
+ },
657
+ {
658
+ "epoch": 0.12133333333333333,
659
+ "grad_norm": 0.34121465203098633,
660
+ "learning_rate": 2.9846970702950653e-05,
661
+ "loss": 1.9493,
662
+ "step": 91
663
+ },
664
+ {
665
+ "epoch": 0.12266666666666666,
666
+ "grad_norm": 0.2069391899522312,
667
+ "learning_rate": 2.9842638114560144e-05,
668
+ "loss": 2.2207,
669
+ "step": 92
670
+ },
671
+ {
672
+ "epoch": 0.124,
673
+ "grad_norm": 0.41038744334249916,
674
+ "learning_rate": 2.9838245406323763e-05,
675
+ "loss": 2.0954,
676
+ "step": 93
677
+ },
678
+ {
679
+ "epoch": 0.12533333333333332,
680
+ "grad_norm": 0.2818876029106677,
681
+ "learning_rate": 2.9833792598034362e-05,
682
+ "loss": 2.063,
683
+ "step": 94
684
+ },
685
+ {
686
+ "epoch": 0.12666666666666668,
687
+ "grad_norm": 0.3963506275688602,
688
+ "learning_rate": 2.9829279709755597e-05,
689
+ "loss": 1.9915,
690
+ "step": 95
691
+ },
692
+ {
693
+ "epoch": 0.128,
694
+ "grad_norm": 0.3357399751778237,
695
+ "learning_rate": 2.9824706761821845e-05,
696
+ "loss": 2.0319,
697
+ "step": 96
698
+ },
699
+ {
700
+ "epoch": 0.12933333333333333,
701
+ "grad_norm": 0.2484694376564558,
702
+ "learning_rate": 2.9820073774838092e-05,
703
+ "loss": 1.9593,
704
+ "step": 97
705
+ },
706
+ {
707
+ "epoch": 0.13066666666666665,
708
+ "grad_norm": 0.28583010649806945,
709
+ "learning_rate": 2.9815380769679853e-05,
710
+ "loss": 2.0876,
711
+ "step": 98
712
+ },
713
+ {
714
+ "epoch": 0.132,
715
+ "grad_norm": 0.2839808516235786,
716
+ "learning_rate": 2.9810627767493083e-05,
717
+ "loss": 2.107,
718
+ "step": 99
719
+ },
720
+ {
721
+ "epoch": 0.13333333333333333,
722
+ "grad_norm": 0.22309498057469626,
723
+ "learning_rate": 2.9805814789694065e-05,
724
+ "loss": 2.3163,
725
+ "step": 100
726
+ },
727
+ {
728
+ "epoch": 0.13466666666666666,
729
+ "grad_norm": 0.21629704903995522,
730
+ "learning_rate": 2.9800941857969325e-05,
731
+ "loss": 1.9568,
732
+ "step": 101
733
+ },
734
+ {
735
+ "epoch": 0.136,
736
+ "grad_norm": 0.2498232144819958,
737
+ "learning_rate": 2.9796008994275533e-05,
738
+ "loss": 2.2436,
739
+ "step": 102
740
+ },
741
+ {
742
+ "epoch": 0.13733333333333334,
743
+ "grad_norm": 0.23251794806618262,
744
+ "learning_rate": 2.979101622083941e-05,
745
+ "loss": 2.1077,
746
+ "step": 103
747
+ },
748
+ {
749
+ "epoch": 0.13866666666666666,
750
+ "grad_norm": 0.27094215165929786,
751
+ "learning_rate": 2.978596356015761e-05,
752
+ "loss": 2.2429,
753
+ "step": 104
754
+ },
755
+ {
756
+ "epoch": 0.14,
757
+ "grad_norm": 0.23381976958310982,
758
+ "learning_rate": 2.978085103499663e-05,
759
+ "loss": 2.1045,
760
+ "step": 105
761
+ },
762
+ {
763
+ "epoch": 0.14133333333333334,
764
+ "grad_norm": 0.31137413672950875,
765
+ "learning_rate": 2.9775678668392713e-05,
766
+ "loss": 2.2828,
767
+ "step": 106
768
+ },
769
+ {
770
+ "epoch": 0.14266666666666666,
771
+ "grad_norm": 0.2510514954722372,
772
+ "learning_rate": 2.9770446483651735e-05,
773
+ "loss": 2.0488,
774
+ "step": 107
775
+ },
776
+ {
777
+ "epoch": 0.144,
778
+ "grad_norm": 0.20677625084575546,
779
+ "learning_rate": 2.976515450434911e-05,
780
+ "loss": 2.0007,
781
+ "step": 108
782
+ },
783
+ {
784
+ "epoch": 0.14533333333333334,
785
+ "grad_norm": 0.1441931078695141,
786
+ "learning_rate": 2.9759802754329665e-05,
787
+ "loss": 2.2914,
788
+ "step": 109
789
+ },
790
+ {
791
+ "epoch": 0.14666666666666667,
792
+ "grad_norm": 0.2505777324570115,
793
+ "learning_rate": 2.9754391257707555e-05,
794
+ "loss": 2.3766,
795
+ "step": 110
796
+ },
797
+ {
798
+ "epoch": 0.148,
799
+ "grad_norm": 0.15292718959769924,
800
+ "learning_rate": 2.9748920038866134e-05,
801
+ "loss": 2.0941,
802
+ "step": 111
803
+ },
804
+ {
805
+ "epoch": 0.14933333333333335,
806
+ "grad_norm": 0.323283263320876,
807
+ "learning_rate": 2.9743389122457864e-05,
808
+ "loss": 2.249,
809
+ "step": 112
810
+ },
811
+ {
812
+ "epoch": 0.15066666666666667,
813
+ "grad_norm": 0.18122528780601804,
814
+ "learning_rate": 2.9737798533404195e-05,
815
+ "loss": 2.225,
816
+ "step": 113
817
+ },
818
+ {
819
+ "epoch": 0.152,
820
+ "grad_norm": 0.1959413592260722,
821
+ "learning_rate": 2.9732148296895444e-05,
822
+ "loss": 2.2521,
823
+ "step": 114
824
+ },
825
+ {
826
+ "epoch": 0.15333333333333332,
827
+ "grad_norm": 0.3035240193818014,
828
+ "learning_rate": 2.9726438438390702e-05,
829
+ "loss": 1.8693,
830
+ "step": 115
831
+ },
832
+ {
833
+ "epoch": 0.15466666666666667,
834
+ "grad_norm": 0.18185372954938864,
835
+ "learning_rate": 2.9720668983617685e-05,
836
+ "loss": 1.7352,
837
+ "step": 116
838
+ },
839
+ {
840
+ "epoch": 0.156,
841
+ "grad_norm": 0.22324589503263917,
842
+ "learning_rate": 2.9714839958572674e-05,
843
+ "loss": 2.1327,
844
+ "step": 117
845
+ },
846
+ {
847
+ "epoch": 0.15733333333333333,
848
+ "grad_norm": 0.17640470731953065,
849
+ "learning_rate": 2.9708951389520338e-05,
850
+ "loss": 2.3011,
851
+ "step": 118
852
+ },
853
+ {
854
+ "epoch": 0.15866666666666668,
855
+ "grad_norm": 0.22566777150432568,
856
+ "learning_rate": 2.970300330299365e-05,
857
+ "loss": 2.2591,
858
+ "step": 119
859
+ },
860
+ {
861
+ "epoch": 0.16,
862
+ "grad_norm": 0.26753575405847074,
863
+ "learning_rate": 2.9696995725793764e-05,
864
+ "loss": 1.9764,
865
+ "step": 120
866
+ },
867
+ {
868
+ "epoch": 0.16133333333333333,
869
+ "grad_norm": 0.2625838791216609,
870
+ "learning_rate": 2.969092868498988e-05,
871
+ "loss": 2.0906,
872
+ "step": 121
873
+ },
874
+ {
875
+ "epoch": 0.16266666666666665,
876
+ "grad_norm": 0.23542440806071088,
877
+ "learning_rate": 2.9684802207919144e-05,
878
+ "loss": 2.331,
879
+ "step": 122
880
+ },
881
+ {
882
+ "epoch": 0.164,
883
+ "grad_norm": 0.16614596830245607,
884
+ "learning_rate": 2.9678616322186506e-05,
885
+ "loss": 2.0534,
886
+ "step": 123
887
+ },
888
+ {
889
+ "epoch": 0.16533333333333333,
890
+ "grad_norm": 0.2087450409416601,
891
+ "learning_rate": 2.9672371055664598e-05,
892
+ "loss": 2.3241,
893
+ "step": 124
894
+ },
895
+ {
896
+ "epoch": 0.16666666666666666,
897
+ "grad_norm": 0.19314135153093137,
898
+ "learning_rate": 2.9666066436493612e-05,
899
+ "loss": 2.0654,
900
+ "step": 125
901
+ },
902
+ {
903
+ "epoch": 0.168,
904
+ "grad_norm": 0.21431158762567906,
905
+ "learning_rate": 2.9659702493081184e-05,
906
+ "loss": 1.9973,
907
+ "step": 126
908
+ },
909
+ {
910
+ "epoch": 0.16933333333333334,
911
+ "grad_norm": 1.9808549742241428,
912
+ "learning_rate": 2.965327925410226e-05,
913
+ "loss": 2.0077,
914
+ "step": 127
915
+ },
916
+ {
917
+ "epoch": 0.17066666666666666,
918
+ "grad_norm": 0.26139018103307665,
919
+ "learning_rate": 2.9646796748498934e-05,
920
+ "loss": 2.0224,
921
+ "step": 128
922
+ },
923
+ {
924
+ "epoch": 0.172,
925
+ "grad_norm": 0.25005627301032396,
926
+ "learning_rate": 2.9640255005480376e-05,
927
+ "loss": 2.2725,
928
+ "step": 129
929
+ },
930
+ {
931
+ "epoch": 0.17333333333333334,
932
+ "grad_norm": 0.15880416331686953,
933
+ "learning_rate": 2.9633654054522655e-05,
934
+ "loss": 2.2907,
935
+ "step": 130
936
+ },
937
+ {
938
+ "epoch": 0.17466666666666666,
939
+ "grad_norm": 0.33529634494995547,
940
+ "learning_rate": 2.9626993925368635e-05,
941
+ "loss": 2.0113,
942
+ "step": 131
943
+ },
944
+ {
945
+ "epoch": 0.176,
946
+ "grad_norm": 0.28246655756719385,
947
+ "learning_rate": 2.9620274648027805e-05,
948
+ "loss": 1.6602,
949
+ "step": 132
950
+ },
951
+ {
952
+ "epoch": 0.17733333333333334,
953
+ "grad_norm": 0.19752740198982274,
954
+ "learning_rate": 2.96134962527762e-05,
955
+ "loss": 2.4049,
956
+ "step": 133
957
+ },
958
+ {
959
+ "epoch": 0.17866666666666667,
960
+ "grad_norm": 0.24132626803001633,
961
+ "learning_rate": 2.960665877015619e-05,
962
+ "loss": 1.9882,
963
+ "step": 134
964
+ },
965
+ {
966
+ "epoch": 0.18,
967
+ "grad_norm": 0.20387262050547844,
968
+ "learning_rate": 2.959976223097642e-05,
969
+ "loss": 2.1055,
970
+ "step": 135
971
+ },
972
+ {
973
+ "epoch": 0.18133333333333335,
974
+ "grad_norm": 0.1624504592181784,
975
+ "learning_rate": 2.9592806666311612e-05,
976
+ "loss": 2.0519,
977
+ "step": 136
978
+ },
979
+ {
980
+ "epoch": 0.18266666666666667,
981
+ "grad_norm": 0.17429829196685362,
982
+ "learning_rate": 2.958579210750246e-05,
983
+ "loss": 2.2465,
984
+ "step": 137
985
+ },
986
+ {
987
+ "epoch": 0.184,
988
+ "grad_norm": 0.20450521945125338,
989
+ "learning_rate": 2.9578718586155467e-05,
990
+ "loss": 2.0122,
991
+ "step": 138
992
+ },
993
+ {
994
+ "epoch": 0.18533333333333332,
995
+ "grad_norm": 0.25553899821370907,
996
+ "learning_rate": 2.9571586134142824e-05,
997
+ "loss": 1.9806,
998
+ "step": 139
999
+ },
1000
+ {
1001
+ "epoch": 0.18666666666666668,
1002
+ "grad_norm": 0.18251828378522442,
1003
+ "learning_rate": 2.956439478360224e-05,
1004
+ "loss": 2.0153,
1005
+ "step": 140
1006
+ },
1007
+ {
1008
+ "epoch": 0.188,
1009
+ "grad_norm": 0.27047669211119385,
1010
+ "learning_rate": 2.9557144566936813e-05,
1011
+ "loss": 2.3386,
1012
+ "step": 141
1013
+ },
1014
+ {
1015
+ "epoch": 0.18933333333333333,
1016
+ "grad_norm": 0.25054213266869674,
1017
+ "learning_rate": 2.9549835516814905e-05,
1018
+ "loss": 1.8871,
1019
+ "step": 142
1020
+ },
1021
+ {
1022
+ "epoch": 0.19066666666666668,
1023
+ "grad_norm": 0.2177646465341283,
1024
+ "learning_rate": 2.9542467666169946e-05,
1025
+ "loss": 1.9861,
1026
+ "step": 143
1027
+ },
1028
+ {
1029
+ "epoch": 0.192,
1030
+ "grad_norm": 0.23131873734472552,
1031
+ "learning_rate": 2.953504104820032e-05,
1032
+ "loss": 2.2903,
1033
+ "step": 144
1034
+ },
1035
+ {
1036
+ "epoch": 0.19333333333333333,
1037
+ "grad_norm": 0.19112449587530203,
1038
+ "learning_rate": 2.9527555696369217e-05,
1039
+ "loss": 2.079,
1040
+ "step": 145
1041
+ },
1042
+ {
1043
+ "epoch": 0.19466666666666665,
1044
+ "grad_norm": 0.25269202301867255,
1045
+ "learning_rate": 2.9520011644404457e-05,
1046
+ "loss": 2.0033,
1047
+ "step": 146
1048
+ },
1049
+ {
1050
+ "epoch": 0.196,
1051
+ "grad_norm": 0.27870777753889076,
1052
+ "learning_rate": 2.9512408926298362e-05,
1053
+ "loss": 2.0747,
1054
+ "step": 147
1055
+ },
1056
+ {
1057
+ "epoch": 0.19733333333333333,
1058
+ "grad_norm": 0.2356849572111222,
1059
+ "learning_rate": 2.9504747576307594e-05,
1060
+ "loss": 1.9859,
1061
+ "step": 148
1062
+ },
1063
+ {
1064
+ "epoch": 0.19866666666666666,
1065
+ "grad_norm": 0.24802404030831268,
1066
+ "learning_rate": 2.9497027628953e-05,
1067
+ "loss": 1.9636,
1068
+ "step": 149
1069
+ },
1070
+ {
1071
+ "epoch": 0.2,
1072
+ "grad_norm": 0.2705817304080749,
1073
+ "learning_rate": 2.9489249119019465e-05,
1074
+ "loss": 2.3547,
1075
+ "step": 150
1076
+ },
1077
+ {
1078
+ "epoch": 0.2,
1079
+ "eval_loss": 1.764258623123169,
1080
+ "eval_runtime": 98.8204,
1081
+ "eval_samples_per_second": 1.012,
1082
+ "eval_steps_per_second": 0.253,
1083
+ "step": 150
1084
+ },
1085
+ {
1086
+ "epoch": 0.20133333333333334,
1087
+ "grad_norm": 0.2582615198348039,
1088
+ "learning_rate": 2.948141208155574e-05,
1089
+ "loss": 2.0245,
1090
+ "step": 151
1091
+ },
1092
+ {
1093
+ "epoch": 0.20266666666666666,
1094
+ "grad_norm": 0.4907789285423073,
1095
+ "learning_rate": 2.9473516551874283e-05,
1096
+ "loss": 2.0994,
1097
+ "step": 152
1098
+ },
1099
+ {
1100
+ "epoch": 0.204,
1101
+ "grad_norm": 0.18922034328138926,
1102
+ "learning_rate": 2.946556256555113e-05,
1103
+ "loss": 2.323,
1104
+ "step": 153
1105
+ },
1106
+ {
1107
+ "epoch": 0.20533333333333334,
1108
+ "grad_norm": 0.2033901054041812,
1109
+ "learning_rate": 2.94575501584257e-05,
1110
+ "loss": 2.0412,
1111
+ "step": 154
1112
+ },
1113
+ {
1114
+ "epoch": 0.20666666666666667,
1115
+ "grad_norm": 0.44181703724009636,
1116
+ "learning_rate": 2.9449479366600646e-05,
1117
+ "loss": 2.0977,
1118
+ "step": 155
1119
+ },
1120
+ {
1121
+ "epoch": 0.208,
1122
+ "grad_norm": 0.23299679001877052,
1123
+ "learning_rate": 2.94413502264417e-05,
1124
+ "loss": 2.1351,
1125
+ "step": 156
1126
+ },
1127
+ {
1128
+ "epoch": 0.20933333333333334,
1129
+ "grad_norm": 0.1826511129435199,
1130
+ "learning_rate": 2.94331627745775e-05,
1131
+ "loss": 2.23,
1132
+ "step": 157
1133
+ },
1134
+ {
1135
+ "epoch": 0.21066666666666667,
1136
+ "grad_norm": 0.21212768867139842,
1137
+ "learning_rate": 2.9424917047899425e-05,
1138
+ "loss": 2.1412,
1139
+ "step": 158
1140
+ },
1141
+ {
1142
+ "epoch": 0.212,
1143
+ "grad_norm": 0.3826582810934168,
1144
+ "learning_rate": 2.9416613083561428e-05,
1145
+ "loss": 2.2128,
1146
+ "step": 159
1147
+ },
1148
+ {
1149
+ "epoch": 0.21333333333333335,
1150
+ "grad_norm": 0.19962942232450726,
1151
+ "learning_rate": 2.9408250918979886e-05,
1152
+ "loss": 1.9778,
1153
+ "step": 160
1154
+ },
1155
+ {
1156
+ "epoch": 0.21466666666666667,
1157
+ "grad_norm": 0.1694294631609089,
1158
+ "learning_rate": 2.9399830591833407e-05,
1159
+ "loss": 2.4602,
1160
+ "step": 161
1161
+ },
1162
+ {
1163
+ "epoch": 0.216,
1164
+ "grad_norm": 0.2899013004381108,
1165
+ "learning_rate": 2.9391352140062668e-05,
1166
+ "loss": 1.9118,
1167
+ "step": 162
1168
+ },
1169
+ {
1170
+ "epoch": 0.21733333333333332,
1171
+ "grad_norm": 0.23107679488781446,
1172
+ "learning_rate": 2.9382815601870252e-05,
1173
+ "loss": 2.0482,
1174
+ "step": 163
1175
+ },
1176
+ {
1177
+ "epoch": 0.21866666666666668,
1178
+ "grad_norm": 0.21890406305674734,
1179
+ "learning_rate": 2.9374221015720465e-05,
1180
+ "loss": 2.041,
1181
+ "step": 164
1182
+ },
1183
+ {
1184
+ "epoch": 0.22,
1185
+ "grad_norm": 0.23483748269482313,
1186
+ "learning_rate": 2.9365568420339173e-05,
1187
+ "loss": 2.2775,
1188
+ "step": 165
1189
+ },
1190
+ {
1191
+ "epoch": 0.22133333333333333,
1192
+ "grad_norm": 0.2872971245253849,
1193
+ "learning_rate": 2.9356857854713628e-05,
1194
+ "loss": 1.9568,
1195
+ "step": 166
1196
+ },
1197
+ {
1198
+ "epoch": 0.22266666666666668,
1199
+ "grad_norm": 0.2372348840011155,
1200
+ "learning_rate": 2.9348089358092266e-05,
1201
+ "loss": 2.262,
1202
+ "step": 167
1203
+ },
1204
+ {
1205
+ "epoch": 0.224,
1206
+ "grad_norm": 0.36284791627622115,
1207
+ "learning_rate": 2.9339262969984575e-05,
1208
+ "loss": 2.083,
1209
+ "step": 168
1210
+ },
1211
+ {
1212
+ "epoch": 0.22533333333333333,
1213
+ "grad_norm": 0.29336818899957795,
1214
+ "learning_rate": 2.9330378730160882e-05,
1215
+ "loss": 1.9668,
1216
+ "step": 169
1217
+ },
1218
+ {
1219
+ "epoch": 0.22666666666666666,
1220
+ "grad_norm": 0.2201337837382957,
1221
+ "learning_rate": 2.932143667865218e-05,
1222
+ "loss": 2.2411,
1223
+ "step": 170
1224
+ },
1225
+ {
1226
+ "epoch": 0.228,
1227
+ "grad_norm": 0.23312129174527574,
1228
+ "learning_rate": 2.931243685574997e-05,
1229
+ "loss": 1.7557,
1230
+ "step": 171
1231
+ },
1232
+ {
1233
+ "epoch": 0.22933333333333333,
1234
+ "grad_norm": 0.2158943628320117,
1235
+ "learning_rate": 2.930337930200603e-05,
1236
+ "loss": 2.2391,
1237
+ "step": 172
1238
+ },
1239
+ {
1240
+ "epoch": 0.23066666666666666,
1241
+ "grad_norm": 0.22961082626415702,
1242
+ "learning_rate": 2.929426405823231e-05,
1243
+ "loss": 1.983,
1244
+ "step": 173
1245
+ },
1246
+ {
1247
+ "epoch": 0.232,
1248
+ "grad_norm": 0.1589890947108339,
1249
+ "learning_rate": 2.9285091165500653e-05,
1250
+ "loss": 2.2212,
1251
+ "step": 174
1252
+ },
1253
+ {
1254
+ "epoch": 0.23333333333333334,
1255
+ "grad_norm": 0.3416891731513133,
1256
+ "learning_rate": 2.9275860665142697e-05,
1257
+ "loss": 2.151,
1258
+ "step": 175
1259
+ },
1260
+ {
1261
+ "epoch": 0.23466666666666666,
1262
+ "grad_norm": 0.3116792178273567,
1263
+ "learning_rate": 2.9266572598749632e-05,
1264
+ "loss": 1.9779,
1265
+ "step": 176
1266
+ },
1267
+ {
1268
+ "epoch": 0.236,
1269
+ "grad_norm": 0.2507975453340776,
1270
+ "learning_rate": 2.925722700817204e-05,
1271
+ "loss": 1.8828,
1272
+ "step": 177
1273
+ },
1274
+ {
1275
+ "epoch": 0.23733333333333334,
1276
+ "grad_norm": 0.2679656189407352,
1277
+ "learning_rate": 2.9247823935519685e-05,
1278
+ "loss": 2.0387,
1279
+ "step": 178
1280
+ },
1281
+ {
1282
+ "epoch": 0.23866666666666667,
1283
+ "grad_norm": 0.23989994031916193,
1284
+ "learning_rate": 2.9238363423161357e-05,
1285
+ "loss": 2.0543,
1286
+ "step": 179
1287
+ },
1288
+ {
1289
+ "epoch": 0.24,
1290
+ "grad_norm": 0.1858192959656248,
1291
+ "learning_rate": 2.9228845513724636e-05,
1292
+ "loss": 2.1298,
1293
+ "step": 180
1294
+ },
1295
+ {
1296
+ "epoch": 0.24133333333333334,
1297
+ "grad_norm": 0.2634790646951208,
1298
+ "learning_rate": 2.921927025009575e-05,
1299
+ "loss": 1.8913,
1300
+ "step": 181
1301
+ },
1302
+ {
1303
+ "epoch": 0.24266666666666667,
1304
+ "grad_norm": 0.2022924360136422,
1305
+ "learning_rate": 2.920963767541933e-05,
1306
+ "loss": 2.125,
1307
+ "step": 182
1308
+ },
1309
+ {
1310
+ "epoch": 0.244,
1311
+ "grad_norm": 0.2093674676298739,
1312
+ "learning_rate": 2.919994783309827e-05,
1313
+ "loss": 2.0787,
1314
+ "step": 183
1315
+ },
1316
+ {
1317
+ "epoch": 0.24533333333333332,
1318
+ "grad_norm": 0.21965402179914853,
1319
+ "learning_rate": 2.9190200766793476e-05,
1320
+ "loss": 2.2001,
1321
+ "step": 184
1322
+ },
1323
+ {
1324
+ "epoch": 0.24666666666666667,
1325
+ "grad_norm": 0.1853707607114335,
1326
+ "learning_rate": 2.9180396520423712e-05,
1327
+ "loss": 2.2625,
1328
+ "step": 185
1329
+ },
1330
+ {
1331
+ "epoch": 0.248,
1332
+ "grad_norm": 0.16161978911997635,
1333
+ "learning_rate": 2.9170535138165386e-05,
1334
+ "loss": 2.1841,
1335
+ "step": 186
1336
+ },
1337
+ {
1338
+ "epoch": 0.24933333333333332,
1339
+ "grad_norm": 0.3229805512119213,
1340
+ "learning_rate": 2.9160616664452343e-05,
1341
+ "loss": 2.3706,
1342
+ "step": 187
1343
+ },
1344
+ {
1345
+ "epoch": 0.25066666666666665,
1346
+ "grad_norm": 0.2836753076906196,
1347
+ "learning_rate": 2.915064114397568e-05,
1348
+ "loss": 1.6237,
1349
+ "step": 188
1350
+ },
1351
+ {
1352
+ "epoch": 0.252,
1353
+ "grad_norm": 0.2698172260057232,
1354
+ "learning_rate": 2.9140608621683537e-05,
1355
+ "loss": 1.6731,
1356
+ "step": 189
1357
+ },
1358
+ {
1359
+ "epoch": 0.25333333333333335,
1360
+ "grad_norm": 0.19227100326585553,
1361
+ "learning_rate": 2.913051914278089e-05,
1362
+ "loss": 2.3641,
1363
+ "step": 190
1364
+ },
1365
+ {
1366
+ "epoch": 0.25466666666666665,
1367
+ "grad_norm": 0.17075536856147475,
1368
+ "learning_rate": 2.9120372752729364e-05,
1369
+ "loss": 2.1405,
1370
+ "step": 191
1371
+ },
1372
+ {
1373
+ "epoch": 0.256,
1374
+ "grad_norm": 0.18977706986071732,
1375
+ "learning_rate": 2.9110169497247005e-05,
1376
+ "loss": 2.2104,
1377
+ "step": 192
1378
+ },
1379
+ {
1380
+ "epoch": 0.25733333333333336,
1381
+ "grad_norm": 0.19137759329665405,
1382
+ "learning_rate": 2.909990942230809e-05,
1383
+ "loss": 2.0225,
1384
+ "step": 193
1385
+ },
1386
+ {
1387
+ "epoch": 0.25866666666666666,
1388
+ "grad_norm": 1.565607637042905,
1389
+ "learning_rate": 2.9089592574142925e-05,
1390
+ "loss": 1.8544,
1391
+ "step": 194
1392
+ },
1393
+ {
1394
+ "epoch": 0.26,
1395
+ "grad_norm": 0.19540116729751783,
1396
+ "learning_rate": 2.9079218999237602e-05,
1397
+ "loss": 2.3062,
1398
+ "step": 195
1399
+ },
1400
+ {
1401
+ "epoch": 0.2613333333333333,
1402
+ "grad_norm": 0.210417328406617,
1403
+ "learning_rate": 2.9068788744333847e-05,
1404
+ "loss": 2.1931,
1405
+ "step": 196
1406
+ },
1407
+ {
1408
+ "epoch": 0.26266666666666666,
1409
+ "grad_norm": 0.17528914661443556,
1410
+ "learning_rate": 2.905830185642875e-05,
1411
+ "loss": 2.213,
1412
+ "step": 197
1413
+ },
1414
+ {
1415
+ "epoch": 0.264,
1416
+ "grad_norm": 0.19180797982535258,
1417
+ "learning_rate": 2.90477583827746e-05,
1418
+ "loss": 2.0999,
1419
+ "step": 198
1420
+ },
1421
+ {
1422
+ "epoch": 0.2653333333333333,
1423
+ "grad_norm": 0.20387611359379645,
1424
+ "learning_rate": 2.903715837087864e-05,
1425
+ "loss": 2.4077,
1426
+ "step": 199
1427
+ },
1428
+ {
1429
+ "epoch": 0.26666666666666666,
1430
+ "grad_norm": 0.37614141580385196,
1431
+ "learning_rate": 2.9026501868502878e-05,
1432
+ "loss": 2.345,
1433
+ "step": 200
1434
+ },
1435
+ {
1436
+ "epoch": 0.268,
1437
+ "grad_norm": 0.23854637338124732,
1438
+ "learning_rate": 2.901578892366384e-05,
1439
+ "loss": 1.9616,
1440
+ "step": 201
1441
+ },
1442
+ {
1443
+ "epoch": 0.2693333333333333,
1444
+ "grad_norm": 0.360995797147823,
1445
+ "learning_rate": 2.9005019584632385e-05,
1446
+ "loss": 2.2637,
1447
+ "step": 202
1448
+ },
1449
+ {
1450
+ "epoch": 0.27066666666666667,
1451
+ "grad_norm": 0.1964827412506095,
1452
+ "learning_rate": 2.899419389993348e-05,
1453
+ "loss": 2.2153,
1454
+ "step": 203
1455
+ },
1456
+ {
1457
+ "epoch": 0.272,
1458
+ "grad_norm": 0.2697880949391648,
1459
+ "learning_rate": 2.8983311918345973e-05,
1460
+ "loss": 1.8934,
1461
+ "step": 204
1462
+ },
1463
+ {
1464
+ "epoch": 0.2733333333333333,
1465
+ "grad_norm": 0.2583392197903811,
1466
+ "learning_rate": 2.8972373688902372e-05,
1467
+ "loss": 2.1552,
1468
+ "step": 205
1469
+ },
1470
+ {
1471
+ "epoch": 0.27466666666666667,
1472
+ "grad_norm": 0.2368677594314249,
1473
+ "learning_rate": 2.8961379260888634e-05,
1474
+ "loss": 2.1976,
1475
+ "step": 206
1476
+ },
1477
+ {
1478
+ "epoch": 0.276,
1479
+ "grad_norm": 0.28279002863897235,
1480
+ "learning_rate": 2.895032868384393e-05,
1481
+ "loss": 2.0461,
1482
+ "step": 207
1483
+ },
1484
+ {
1485
+ "epoch": 0.2773333333333333,
1486
+ "grad_norm": 0.17431186704640889,
1487
+ "learning_rate": 2.8939222007560446e-05,
1488
+ "loss": 2.2258,
1489
+ "step": 208
1490
+ },
1491
+ {
1492
+ "epoch": 0.2786666666666667,
1493
+ "grad_norm": 0.2607071149284682,
1494
+ "learning_rate": 2.8928059282083126e-05,
1495
+ "loss": 1.8783,
1496
+ "step": 209
1497
+ },
1498
+ {
1499
+ "epoch": 0.28,
1500
+ "grad_norm": 0.16929508986851496,
1501
+ "learning_rate": 2.8916840557709474e-05,
1502
+ "loss": 2.21,
1503
+ "step": 210
1504
+ },
1505
+ {
1506
+ "epoch": 0.2813333333333333,
1507
+ "grad_norm": 0.30196081586191537,
1508
+ "learning_rate": 2.8905565884989304e-05,
1509
+ "loss": 2.2047,
1510
+ "step": 211
1511
+ },
1512
+ {
1513
+ "epoch": 0.2826666666666667,
1514
+ "grad_norm": 0.22938305453707358,
1515
+ "learning_rate": 2.889423531472455e-05,
1516
+ "loss": 2.1314,
1517
+ "step": 212
1518
+ },
1519
+ {
1520
+ "epoch": 0.284,
1521
+ "grad_norm": 0.24253705098980116,
1522
+ "learning_rate": 2.8882848897968974e-05,
1523
+ "loss": 1.9166,
1524
+ "step": 213
1525
+ },
1526
+ {
1527
+ "epoch": 0.2853333333333333,
1528
+ "grad_norm": 0.21184383213391744,
1529
+ "learning_rate": 2.8871406686028006e-05,
1530
+ "loss": 2.1445,
1531
+ "step": 214
1532
+ },
1533
+ {
1534
+ "epoch": 0.2866666666666667,
1535
+ "grad_norm": 0.20065002671794552,
1536
+ "learning_rate": 2.885990873045846e-05,
1537
+ "loss": 2.2053,
1538
+ "step": 215
1539
+ },
1540
+ {
1541
+ "epoch": 0.288,
1542
+ "grad_norm": 0.20595618063455937,
1543
+ "learning_rate": 2.884835508306833e-05,
1544
+ "loss": 2.0275,
1545
+ "step": 216
1546
+ },
1547
+ {
1548
+ "epoch": 0.28933333333333333,
1549
+ "grad_norm": 0.22663216643758846,
1550
+ "learning_rate": 2.883674579591656e-05,
1551
+ "loss": 2.1399,
1552
+ "step": 217
1553
+ },
1554
+ {
1555
+ "epoch": 0.2906666666666667,
1556
+ "grad_norm": 0.20459405622917667,
1557
+ "learning_rate": 2.8825080921312775e-05,
1558
+ "loss": 2.0352,
1559
+ "step": 218
1560
+ },
1561
+ {
1562
+ "epoch": 0.292,
1563
+ "grad_norm": 0.24151264290319524,
1564
+ "learning_rate": 2.8813360511817092e-05,
1565
+ "loss": 2.0821,
1566
+ "step": 219
1567
+ },
1568
+ {
1569
+ "epoch": 0.29333333333333333,
1570
+ "grad_norm": 0.2377001051292377,
1571
+ "learning_rate": 2.8801584620239833e-05,
1572
+ "loss": 2.0448,
1573
+ "step": 220
1574
+ },
1575
+ {
1576
+ "epoch": 0.2946666666666667,
1577
+ "grad_norm": 0.3689003556922912,
1578
+ "learning_rate": 2.878975329964134e-05,
1579
+ "loss": 2.107,
1580
+ "step": 221
1581
+ },
1582
+ {
1583
+ "epoch": 0.296,
1584
+ "grad_norm": 0.19280049013174935,
1585
+ "learning_rate": 2.877786660333169e-05,
1586
+ "loss": 2.2571,
1587
+ "step": 222
1588
+ },
1589
+ {
1590
+ "epoch": 0.29733333333333334,
1591
+ "grad_norm": 0.25684356289858495,
1592
+ "learning_rate": 2.876592458487049e-05,
1593
+ "loss": 1.9243,
1594
+ "step": 223
1595
+ },
1596
+ {
1597
+ "epoch": 0.2986666666666667,
1598
+ "grad_norm": 0.2460391604188184,
1599
+ "learning_rate": 2.8753927298066608e-05,
1600
+ "loss": 2.0996,
1601
+ "step": 224
1602
+ },
1603
+ {
1604
+ "epoch": 0.3,
1605
+ "grad_norm": 0.2287761563539284,
1606
+ "learning_rate": 2.8741874796977947e-05,
1607
+ "loss": 1.9114,
1608
+ "step": 225
1609
+ },
1610
+ {
1611
+ "epoch": 0.3,
1612
+ "eval_loss": 1.7546015977859497,
1613
+ "eval_runtime": 98.7181,
1614
+ "eval_samples_per_second": 1.013,
1615
+ "eval_steps_per_second": 0.253,
1616
+ "step": 225
1617
+ },
1618
+ {
1619
+ "epoch": 0.30133333333333334,
1620
+ "grad_norm": 0.31900423121867244,
1621
+ "learning_rate": 2.8729767135911197e-05,
1622
+ "loss": 1.5513,
1623
+ "step": 226
1624
+ },
1625
+ {
1626
+ "epoch": 0.30266666666666664,
1627
+ "grad_norm": 0.21099907718331903,
1628
+ "learning_rate": 2.8717604369421587e-05,
1629
+ "loss": 1.8735,
1630
+ "step": 227
1631
+ },
1632
+ {
1633
+ "epoch": 0.304,
1634
+ "grad_norm": 0.23179334925928516,
1635
+ "learning_rate": 2.8705386552312647e-05,
1636
+ "loss": 2.0279,
1637
+ "step": 228
1638
+ },
1639
+ {
1640
+ "epoch": 0.30533333333333335,
1641
+ "grad_norm": 0.2249602605454057,
1642
+ "learning_rate": 2.869311373963596e-05,
1643
+ "loss": 2.1506,
1644
+ "step": 229
1645
+ },
1646
+ {
1647
+ "epoch": 0.30666666666666664,
1648
+ "grad_norm": 0.30614551906788895,
1649
+ "learning_rate": 2.8680785986690903e-05,
1650
+ "loss": 2.0057,
1651
+ "step": 230
1652
+ },
1653
+ {
1654
+ "epoch": 0.308,
1655
+ "grad_norm": 0.32936171428483624,
1656
+ "learning_rate": 2.86684033490244e-05,
1657
+ "loss": 1.9719,
1658
+ "step": 231
1659
+ },
1660
+ {
1661
+ "epoch": 0.30933333333333335,
1662
+ "grad_norm": 0.21095264296363714,
1663
+ "learning_rate": 2.8655965882430697e-05,
1664
+ "loss": 2.0526,
1665
+ "step": 232
1666
+ },
1667
+ {
1668
+ "epoch": 0.31066666666666665,
1669
+ "grad_norm": 0.2506508450731689,
1670
+ "learning_rate": 2.8643473642951066e-05,
1671
+ "loss": 2.034,
1672
+ "step": 233
1673
+ },
1674
+ {
1675
+ "epoch": 0.312,
1676
+ "grad_norm": 0.16272890488221511,
1677
+ "learning_rate": 2.8630926686873598e-05,
1678
+ "loss": 2.2394,
1679
+ "step": 234
1680
+ },
1681
+ {
1682
+ "epoch": 0.31333333333333335,
1683
+ "grad_norm": 0.2636216414291072,
1684
+ "learning_rate": 2.8618325070732918e-05,
1685
+ "loss": 1.9559,
1686
+ "step": 235
1687
+ },
1688
+ {
1689
+ "epoch": 0.31466666666666665,
1690
+ "grad_norm": 0.239263175250845,
1691
+ "learning_rate": 2.860566885130994e-05,
1692
+ "loss": 1.9268,
1693
+ "step": 236
1694
+ },
1695
+ {
1696
+ "epoch": 0.316,
1697
+ "grad_norm": 0.3034572733475333,
1698
+ "learning_rate": 2.8592958085631616e-05,
1699
+ "loss": 2.4146,
1700
+ "step": 237
1701
+ },
1702
+ {
1703
+ "epoch": 0.31733333333333336,
1704
+ "grad_norm": 0.22016199356343413,
1705
+ "learning_rate": 2.8580192830970674e-05,
1706
+ "loss": 1.901,
1707
+ "step": 238
1708
+ },
1709
+ {
1710
+ "epoch": 0.31866666666666665,
1711
+ "grad_norm": 0.23102023640647829,
1712
+ "learning_rate": 2.856737314484536e-05,
1713
+ "loss": 2.1498,
1714
+ "step": 239
1715
+ },
1716
+ {
1717
+ "epoch": 0.32,
1718
+ "grad_norm": 0.20238380453993177,
1719
+ "learning_rate": 2.8554499085019177e-05,
1720
+ "loss": 2.0954,
1721
+ "step": 240
1722
+ },
1723
+ {
1724
+ "epoch": 0.32133333333333336,
1725
+ "grad_norm": 0.20808439458283837,
1726
+ "learning_rate": 2.854157070950063e-05,
1727
+ "loss": 2.0783,
1728
+ "step": 241
1729
+ },
1730
+ {
1731
+ "epoch": 0.32266666666666666,
1732
+ "grad_norm": 0.278146848673199,
1733
+ "learning_rate": 2.8528588076542966e-05,
1734
+ "loss": 1.7518,
1735
+ "step": 242
1736
+ },
1737
+ {
1738
+ "epoch": 0.324,
1739
+ "grad_norm": 0.20915907784327617,
1740
+ "learning_rate": 2.8515551244643903e-05,
1741
+ "loss": 1.7229,
1742
+ "step": 243
1743
+ },
1744
+ {
1745
+ "epoch": 0.3253333333333333,
1746
+ "grad_norm": 0.5420477506537028,
1747
+ "learning_rate": 2.850246027254537e-05,
1748
+ "loss": 1.758,
1749
+ "step": 244
1750
+ },
1751
+ {
1752
+ "epoch": 0.32666666666666666,
1753
+ "grad_norm": 0.2800677412041746,
1754
+ "learning_rate": 2.8489315219233248e-05,
1755
+ "loss": 1.9584,
1756
+ "step": 245
1757
+ },
1758
+ {
1759
+ "epoch": 0.328,
1760
+ "grad_norm": 0.22569958386724373,
1761
+ "learning_rate": 2.847611614393709e-05,
1762
+ "loss": 2.1033,
1763
+ "step": 246
1764
+ },
1765
+ {
1766
+ "epoch": 0.3293333333333333,
1767
+ "grad_norm": 0.28225543933402175,
1768
+ "learning_rate": 2.846286310612988e-05,
1769
+ "loss": 2.2579,
1770
+ "step": 247
1771
+ },
1772
+ {
1773
+ "epoch": 0.33066666666666666,
1774
+ "grad_norm": 0.25637825153268795,
1775
+ "learning_rate": 2.844955616552773e-05,
1776
+ "loss": 1.9495,
1777
+ "step": 248
1778
+ },
1779
+ {
1780
+ "epoch": 0.332,
1781
+ "grad_norm": 0.3042322338758206,
1782
+ "learning_rate": 2.8436195382089644e-05,
1783
+ "loss": 2.2247,
1784
+ "step": 249
1785
+ },
1786
+ {
1787
+ "epoch": 0.3333333333333333,
1788
+ "grad_norm": 0.2110291139150889,
1789
+ "learning_rate": 2.8422780816017227e-05,
1790
+ "loss": 2.2582,
1791
+ "step": 250
1792
+ },
1793
+ {
1794
+ "epoch": 0.33466666666666667,
1795
+ "grad_norm": 0.24476057607988286,
1796
+ "learning_rate": 2.8409312527754417e-05,
1797
+ "loss": 2.1626,
1798
+ "step": 251
1799
+ },
1800
+ {
1801
+ "epoch": 0.336,
1802
+ "grad_norm": 0.19859582662498323,
1803
+ "learning_rate": 2.8395790577987225e-05,
1804
+ "loss": 2.0592,
1805
+ "step": 252
1806
+ },
1807
+ {
1808
+ "epoch": 0.3373333333333333,
1809
+ "grad_norm": 0.2560077569832347,
1810
+ "learning_rate": 2.8382215027643447e-05,
1811
+ "loss": 1.9116,
1812
+ "step": 253
1813
+ },
1814
+ {
1815
+ "epoch": 0.33866666666666667,
1816
+ "grad_norm": 0.2325816082873428,
1817
+ "learning_rate": 2.836858593789239e-05,
1818
+ "loss": 2.1088,
1819
+ "step": 254
1820
+ },
1821
+ {
1822
+ "epoch": 0.34,
1823
+ "grad_norm": 0.2633745282570676,
1824
+ "learning_rate": 2.8354903370144613e-05,
1825
+ "loss": 2.3106,
1826
+ "step": 255
1827
+ },
1828
+ {
1829
+ "epoch": 0.3413333333333333,
1830
+ "grad_norm": 0.25738882949934827,
1831
+ "learning_rate": 2.834116738605162e-05,
1832
+ "loss": 1.9449,
1833
+ "step": 256
1834
+ },
1835
+ {
1836
+ "epoch": 0.3426666666666667,
1837
+ "grad_norm": 0.19022713562039067,
1838
+ "learning_rate": 2.8327378047505625e-05,
1839
+ "loss": 2.0301,
1840
+ "step": 257
1841
+ },
1842
+ {
1843
+ "epoch": 0.344,
1844
+ "grad_norm": 0.21459535231309906,
1845
+ "learning_rate": 2.8313535416639232e-05,
1846
+ "loss": 2.3784,
1847
+ "step": 258
1848
+ },
1849
+ {
1850
+ "epoch": 0.3453333333333333,
1851
+ "grad_norm": 0.2725347155581082,
1852
+ "learning_rate": 2.829963955582518e-05,
1853
+ "loss": 1.9749,
1854
+ "step": 259
1855
+ },
1856
+ {
1857
+ "epoch": 0.3466666666666667,
1858
+ "grad_norm": 0.23824574091343387,
1859
+ "learning_rate": 2.828569052767604e-05,
1860
+ "loss": 2.1253,
1861
+ "step": 260
1862
+ },
1863
+ {
1864
+ "epoch": 0.348,
1865
+ "grad_norm": 0.30927788905161807,
1866
+ "learning_rate": 2.8271688395043965e-05,
1867
+ "loss": 1.6379,
1868
+ "step": 261
1869
+ },
1870
+ {
1871
+ "epoch": 0.34933333333333333,
1872
+ "grad_norm": 0.13916235450047149,
1873
+ "learning_rate": 2.8257633221020382e-05,
1874
+ "loss": 2.0781,
1875
+ "step": 262
1876
+ },
1877
+ {
1878
+ "epoch": 0.3506666666666667,
1879
+ "grad_norm": 0.2038351351610376,
1880
+ "learning_rate": 2.8243525068935705e-05,
1881
+ "loss": 2.1167,
1882
+ "step": 263
1883
+ },
1884
+ {
1885
+ "epoch": 0.352,
1886
+ "grad_norm": 0.2282060698474947,
1887
+ "learning_rate": 2.8229364002359074e-05,
1888
+ "loss": 1.9761,
1889
+ "step": 264
1890
+ },
1891
+ {
1892
+ "epoch": 0.35333333333333333,
1893
+ "grad_norm": 0.19533951565078514,
1894
+ "learning_rate": 2.821515008509804e-05,
1895
+ "loss": 2.3902,
1896
+ "step": 265
1897
+ },
1898
+ {
1899
+ "epoch": 0.3546666666666667,
1900
+ "grad_norm": 0.22231788208405978,
1901
+ "learning_rate": 2.8200883381198297e-05,
1902
+ "loss": 2.2763,
1903
+ "step": 266
1904
+ },
1905
+ {
1906
+ "epoch": 0.356,
1907
+ "grad_norm": 0.17343255119399525,
1908
+ "learning_rate": 2.818656395494339e-05,
1909
+ "loss": 2.1964,
1910
+ "step": 267
1911
+ },
1912
+ {
1913
+ "epoch": 0.35733333333333334,
1914
+ "grad_norm": 0.3190889201723669,
1915
+ "learning_rate": 2.817219187085442e-05,
1916
+ "loss": 1.8898,
1917
+ "step": 268
1918
+ },
1919
+ {
1920
+ "epoch": 0.3586666666666667,
1921
+ "grad_norm": 0.25221058014306336,
1922
+ "learning_rate": 2.8157767193689753e-05,
1923
+ "loss": 2.2137,
1924
+ "step": 269
1925
+ },
1926
+ {
1927
+ "epoch": 0.36,
1928
+ "grad_norm": 0.25456852526288787,
1929
+ "learning_rate": 2.8143289988444737e-05,
1930
+ "loss": 2.1055,
1931
+ "step": 270
1932
+ },
1933
+ {
1934
+ "epoch": 0.36133333333333334,
1935
+ "grad_norm": 0.4387988494124866,
1936
+ "learning_rate": 2.8128760320351403e-05,
1937
+ "loss": 2.0767,
1938
+ "step": 271
1939
+ },
1940
+ {
1941
+ "epoch": 0.3626666666666667,
1942
+ "grad_norm": 0.27199878538165156,
1943
+ "learning_rate": 2.8114178254878156e-05,
1944
+ "loss": 2.1353,
1945
+ "step": 272
1946
+ },
1947
+ {
1948
+ "epoch": 0.364,
1949
+ "grad_norm": 0.2650737962386425,
1950
+ "learning_rate": 2.8099543857729525e-05,
1951
+ "loss": 2.037,
1952
+ "step": 273
1953
+ },
1954
+ {
1955
+ "epoch": 0.36533333333333334,
1956
+ "grad_norm": 0.19178775730316125,
1957
+ "learning_rate": 2.808485719484581e-05,
1958
+ "loss": 2.2227,
1959
+ "step": 274
1960
+ },
1961
+ {
1962
+ "epoch": 0.36666666666666664,
1963
+ "grad_norm": 0.2495538747593823,
1964
+ "learning_rate": 2.8070118332402827e-05,
1965
+ "loss": 2.1111,
1966
+ "step": 275
1967
+ },
1968
+ {
1969
+ "epoch": 0.368,
1970
+ "grad_norm": 0.23278524231204606,
1971
+ "learning_rate": 2.8055327336811585e-05,
1972
+ "loss": 2.0269,
1973
+ "step": 276
1974
+ },
1975
+ {
1976
+ "epoch": 0.36933333333333335,
1977
+ "grad_norm": 0.2909743741626987,
1978
+ "learning_rate": 2.804048427471801e-05,
1979
+ "loss": 1.9845,
1980
+ "step": 277
1981
+ },
1982
+ {
1983
+ "epoch": 0.37066666666666664,
1984
+ "grad_norm": 0.5654501769424853,
1985
+ "learning_rate": 2.8025589213002624e-05,
1986
+ "loss": 1.9375,
1987
+ "step": 278
1988
+ },
1989
+ {
1990
+ "epoch": 0.372,
1991
+ "grad_norm": 0.24272194924805965,
1992
+ "learning_rate": 2.8010642218780246e-05,
1993
+ "loss": 1.9151,
1994
+ "step": 279
1995
+ },
1996
+ {
1997
+ "epoch": 0.37333333333333335,
1998
+ "grad_norm": 0.23044549921686833,
1999
+ "learning_rate": 2.7995643359399703e-05,
2000
+ "loss": 2.1577,
2001
+ "step": 280
2002
+ },
2003
+ {
2004
+ "epoch": 0.37466666666666665,
2005
+ "grad_norm": 0.2297518184027699,
2006
+ "learning_rate": 2.7980592702443518e-05,
2007
+ "loss": 2.2779,
2008
+ "step": 281
2009
+ },
2010
+ {
2011
+ "epoch": 0.376,
2012
+ "grad_norm": 0.43013668653840365,
2013
+ "learning_rate": 2.79654903157276e-05,
2014
+ "loss": 1.9419,
2015
+ "step": 282
2016
+ },
2017
+ {
2018
+ "epoch": 0.37733333333333335,
2019
+ "grad_norm": 0.23835905215528563,
2020
+ "learning_rate": 2.795033626730095e-05,
2021
+ "loss": 2.1743,
2022
+ "step": 283
2023
+ },
2024
+ {
2025
+ "epoch": 0.37866666666666665,
2026
+ "grad_norm": 0.2813058642182358,
2027
+ "learning_rate": 2.793513062544534e-05,
2028
+ "loss": 2.0095,
2029
+ "step": 284
2030
+ },
2031
+ {
2032
+ "epoch": 0.38,
2033
+ "grad_norm": 0.34406979517562547,
2034
+ "learning_rate": 2.7919873458675022e-05,
2035
+ "loss": 1.9489,
2036
+ "step": 285
2037
+ },
2038
+ {
2039
+ "epoch": 0.38133333333333336,
2040
+ "grad_norm": 0.32231785755728004,
2041
+ "learning_rate": 2.790456483573642e-05,
2042
+ "loss": 1.58,
2043
+ "step": 286
2044
+ },
2045
+ {
2046
+ "epoch": 0.38266666666666665,
2047
+ "grad_norm": 0.19333385414604146,
2048
+ "learning_rate": 2.788920482560779e-05,
2049
+ "loss": 1.9752,
2050
+ "step": 287
2051
+ },
2052
+ {
2053
+ "epoch": 0.384,
2054
+ "grad_norm": 0.1765295794157884,
2055
+ "learning_rate": 2.7873793497498945e-05,
2056
+ "loss": 2.0614,
2057
+ "step": 288
2058
+ },
2059
+ {
2060
+ "epoch": 0.38533333333333336,
2061
+ "grad_norm": 0.16114993710297087,
2062
+ "learning_rate": 2.7858330920850923e-05,
2063
+ "loss": 2.2809,
2064
+ "step": 289
2065
+ },
2066
+ {
2067
+ "epoch": 0.38666666666666666,
2068
+ "grad_norm": 0.23825781911565472,
2069
+ "learning_rate": 2.784281716533568e-05,
2070
+ "loss": 2.1522,
2071
+ "step": 290
2072
+ },
2073
+ {
2074
+ "epoch": 0.388,
2075
+ "grad_norm": 0.17001985217323573,
2076
+ "learning_rate": 2.782725230085579e-05,
2077
+ "loss": 1.9506,
2078
+ "step": 291
2079
+ },
2080
+ {
2081
+ "epoch": 0.3893333333333333,
2082
+ "grad_norm": 0.26309455211904453,
2083
+ "learning_rate": 2.7811636397544094e-05,
2084
+ "loss": 2.2083,
2085
+ "step": 292
2086
+ },
2087
+ {
2088
+ "epoch": 0.39066666666666666,
2089
+ "grad_norm": 0.20730249929125907,
2090
+ "learning_rate": 2.7795969525763418e-05,
2091
+ "loss": 2.2347,
2092
+ "step": 293
2093
+ },
2094
+ {
2095
+ "epoch": 0.392,
2096
+ "grad_norm": 0.24136453807928746,
2097
+ "learning_rate": 2.7780251756106242e-05,
2098
+ "loss": 2.029,
2099
+ "step": 294
2100
+ },
2101
+ {
2102
+ "epoch": 0.3933333333333333,
2103
+ "grad_norm": 0.35339039676185235,
2104
+ "learning_rate": 2.7764483159394384e-05,
2105
+ "loss": 2.211,
2106
+ "step": 295
2107
+ },
2108
+ {
2109
+ "epoch": 0.39466666666666667,
2110
+ "grad_norm": 0.25636488066657487,
2111
+ "learning_rate": 2.7748663806678684e-05,
2112
+ "loss": 2.1863,
2113
+ "step": 296
2114
+ },
2115
+ {
2116
+ "epoch": 0.396,
2117
+ "grad_norm": 0.20584179603071404,
2118
+ "learning_rate": 2.7732793769238674e-05,
2119
+ "loss": 2.2463,
2120
+ "step": 297
2121
+ },
2122
+ {
2123
+ "epoch": 0.3973333333333333,
2124
+ "grad_norm": 0.23027068378292892,
2125
+ "learning_rate": 2.7716873118582266e-05,
2126
+ "loss": 2.2356,
2127
+ "step": 298
2128
+ },
2129
+ {
2130
+ "epoch": 0.39866666666666667,
2131
+ "grad_norm": 0.192849837559735,
2132
+ "learning_rate": 2.770090192644543e-05,
2133
+ "loss": 2.2441,
2134
+ "step": 299
2135
+ },
2136
+ {
2137
+ "epoch": 0.4,
2138
+ "grad_norm": 0.2685272097028798,
2139
+ "learning_rate": 2.768488026479187e-05,
2140
+ "loss": 2.0004,
2141
+ "step": 300
2142
+ },
2143
+ {
2144
+ "epoch": 0.4,
2145
+ "eval_loss": 1.7474404573440552,
2146
+ "eval_runtime": 98.8253,
2147
+ "eval_samples_per_second": 1.012,
2148
+ "eval_steps_per_second": 0.253,
2149
+ "step": 300
2150
+ },
2151
+ {
2152
+ "epoch": 0.4013333333333333,
2153
+ "grad_norm": 0.26520047905722166,
2154
+ "learning_rate": 2.766880820581269e-05,
2155
+ "loss": 2.0768,
2156
+ "step": 301
2157
+ },
2158
+ {
2159
+ "epoch": 0.4026666666666667,
2160
+ "grad_norm": 0.31076708165257216,
2161
+ "learning_rate": 2.765268582192608e-05,
2162
+ "loss": 2.0636,
2163
+ "step": 302
2164
+ },
2165
+ {
2166
+ "epoch": 0.404,
2167
+ "grad_norm": 0.25216548132506456,
2168
+ "learning_rate": 2.763651318577699e-05,
2169
+ "loss": 2.0432,
2170
+ "step": 303
2171
+ },
2172
+ {
2173
+ "epoch": 0.4053333333333333,
2174
+ "grad_norm": 0.2393514712031434,
2175
+ "learning_rate": 2.7620290370236786e-05,
2176
+ "loss": 2.3461,
2177
+ "step": 304
2178
+ },
2179
+ {
2180
+ "epoch": 0.4066666666666667,
2181
+ "grad_norm": 0.21599798017132937,
2182
+ "learning_rate": 2.7604017448402954e-05,
2183
+ "loss": 2.3087,
2184
+ "step": 305
2185
+ },
2186
+ {
2187
+ "epoch": 0.408,
2188
+ "grad_norm": 0.3082071954730338,
2189
+ "learning_rate": 2.7587694493598743e-05,
2190
+ "loss": 1.9302,
2191
+ "step": 306
2192
+ },
2193
+ {
2194
+ "epoch": 0.4093333333333333,
2195
+ "grad_norm": 0.21921446809175332,
2196
+ "learning_rate": 2.7571321579372835e-05,
2197
+ "loss": 2.1354,
2198
+ "step": 307
2199
+ },
2200
+ {
2201
+ "epoch": 0.4106666666666667,
2202
+ "grad_norm": 0.2141708374659774,
2203
+ "learning_rate": 2.7554898779499025e-05,
2204
+ "loss": 2.0915,
2205
+ "step": 308
2206
+ },
2207
+ {
2208
+ "epoch": 0.412,
2209
+ "grad_norm": 0.2615730026982423,
2210
+ "learning_rate": 2.7538426167975895e-05,
2211
+ "loss": 1.9005,
2212
+ "step": 309
2213
+ },
2214
+ {
2215
+ "epoch": 0.41333333333333333,
2216
+ "grad_norm": 0.24746987092185013,
2217
+ "learning_rate": 2.7521903819026457e-05,
2218
+ "loss": 2.129,
2219
+ "step": 310
2220
+ },
2221
+ {
2222
+ "epoch": 0.4146666666666667,
2223
+ "grad_norm": 0.23072670442926277,
2224
+ "learning_rate": 2.7505331807097845e-05,
2225
+ "loss": 2.1061,
2226
+ "step": 311
2227
+ },
2228
+ {
2229
+ "epoch": 0.416,
2230
+ "grad_norm": 3.0090982638780748,
2231
+ "learning_rate": 2.7488710206860944e-05,
2232
+ "loss": 2.1382,
2233
+ "step": 312
2234
+ },
2235
+ {
2236
+ "epoch": 0.41733333333333333,
2237
+ "grad_norm": 0.2705418179646343,
2238
+ "learning_rate": 2.7472039093210108e-05,
2239
+ "loss": 2.2219,
2240
+ "step": 313
2241
+ },
2242
+ {
2243
+ "epoch": 0.4186666666666667,
2244
+ "grad_norm": 0.3567890646341258,
2245
+ "learning_rate": 2.7455318541262768e-05,
2246
+ "loss": 1.6885,
2247
+ "step": 314
2248
+ },
2249
+ {
2250
+ "epoch": 0.42,
2251
+ "grad_norm": 0.20851488835910675,
2252
+ "learning_rate": 2.743854862635912e-05,
2253
+ "loss": 1.947,
2254
+ "step": 315
2255
+ },
2256
+ {
2257
+ "epoch": 0.42133333333333334,
2258
+ "grad_norm": 0.24844212488624198,
2259
+ "learning_rate": 2.742172942406179e-05,
2260
+ "loss": 2.1564,
2261
+ "step": 316
2262
+ },
2263
+ {
2264
+ "epoch": 0.4226666666666667,
2265
+ "grad_norm": 0.304277870351501,
2266
+ "learning_rate": 2.7404861010155477e-05,
2267
+ "loss": 1.8723,
2268
+ "step": 317
2269
+ },
2270
+ {
2271
+ "epoch": 0.424,
2272
+ "grad_norm": 0.20754447108795832,
2273
+ "learning_rate": 2.7387943460646624e-05,
2274
+ "loss": 2.4758,
2275
+ "step": 318
2276
+ },
2277
+ {
2278
+ "epoch": 0.42533333333333334,
2279
+ "grad_norm": 0.24883308801860618,
2280
+ "learning_rate": 2.7370976851763068e-05,
2281
+ "loss": 1.6462,
2282
+ "step": 319
2283
+ },
2284
+ {
2285
+ "epoch": 0.4266666666666667,
2286
+ "grad_norm": 0.23024352142002386,
2287
+ "learning_rate": 2.73539612599537e-05,
2288
+ "loss": 2.1038,
2289
+ "step": 320
2290
+ },
2291
+ {
2292
+ "epoch": 0.428,
2293
+ "grad_norm": 0.3139739867715439,
2294
+ "learning_rate": 2.7336896761888126e-05,
2295
+ "loss": 1.902,
2296
+ "step": 321
2297
+ },
2298
+ {
2299
+ "epoch": 0.42933333333333334,
2300
+ "grad_norm": 0.21999868040548332,
2301
+ "learning_rate": 2.7319783434456306e-05,
2302
+ "loss": 2.171,
2303
+ "step": 322
2304
+ },
2305
+ {
2306
+ "epoch": 0.43066666666666664,
2307
+ "grad_norm": 0.2467001853935637,
2308
+ "learning_rate": 2.7302621354768233e-05,
2309
+ "loss": 1.9828,
2310
+ "step": 323
2311
+ },
2312
+ {
2313
+ "epoch": 0.432,
2314
+ "grad_norm": 0.2496730220022523,
2315
+ "learning_rate": 2.728541060015356e-05,
2316
+ "loss": 2.3413,
2317
+ "step": 324
2318
+ },
2319
+ {
2320
+ "epoch": 0.43333333333333335,
2321
+ "grad_norm": 0.4056202922850381,
2322
+ "learning_rate": 2.7268151248161253e-05,
2323
+ "loss": 2.0407,
2324
+ "step": 325
2325
+ },
2326
+ {
2327
+ "epoch": 0.43466666666666665,
2328
+ "grad_norm": 0.2171839709482431,
2329
+ "learning_rate": 2.7250843376559265e-05,
2330
+ "loss": 2.0089,
2331
+ "step": 326
2332
+ },
2333
+ {
2334
+ "epoch": 0.436,
2335
+ "grad_norm": 0.26937671179427725,
2336
+ "learning_rate": 2.7233487063334172e-05,
2337
+ "loss": 2.2827,
2338
+ "step": 327
2339
+ },
2340
+ {
2341
+ "epoch": 0.43733333333333335,
2342
+ "grad_norm": 0.20567971005307945,
2343
+ "learning_rate": 2.7216082386690804e-05,
2344
+ "loss": 2.2386,
2345
+ "step": 328
2346
+ },
2347
+ {
2348
+ "epoch": 0.43866666666666665,
2349
+ "grad_norm": 0.2027173811893726,
2350
+ "learning_rate": 2.7198629425051917e-05,
2351
+ "loss": 2.3031,
2352
+ "step": 329
2353
+ },
2354
+ {
2355
+ "epoch": 0.44,
2356
+ "grad_norm": 0.2980323024473477,
2357
+ "learning_rate": 2.7181128257057846e-05,
2358
+ "loss": 2.1883,
2359
+ "step": 330
2360
+ },
2361
+ {
2362
+ "epoch": 0.44133333333333336,
2363
+ "grad_norm": 0.20192851290457803,
2364
+ "learning_rate": 2.716357896156611e-05,
2365
+ "loss": 1.9459,
2366
+ "step": 331
2367
+ },
2368
+ {
2369
+ "epoch": 0.44266666666666665,
2370
+ "grad_norm": 0.2871519578556366,
2371
+ "learning_rate": 2.7145981617651108e-05,
2372
+ "loss": 2.1742,
2373
+ "step": 332
2374
+ },
2375
+ {
2376
+ "epoch": 0.444,
2377
+ "grad_norm": 0.1949811627760568,
2378
+ "learning_rate": 2.712833630460372e-05,
2379
+ "loss": 2.2647,
2380
+ "step": 333
2381
+ },
2382
+ {
2383
+ "epoch": 0.44533333333333336,
2384
+ "grad_norm": 0.2741619826374072,
2385
+ "learning_rate": 2.7110643101930978e-05,
2386
+ "loss": 1.9767,
2387
+ "step": 334
2388
+ },
2389
+ {
2390
+ "epoch": 0.44666666666666666,
2391
+ "grad_norm": 0.22305808717890685,
2392
+ "learning_rate": 2.7092902089355693e-05,
2393
+ "loss": 1.9271,
2394
+ "step": 335
2395
+ },
2396
+ {
2397
+ "epoch": 0.448,
2398
+ "grad_norm": 0.31086274795948865,
2399
+ "learning_rate": 2.7075113346816092e-05,
2400
+ "loss": 1.9092,
2401
+ "step": 336
2402
+ },
2403
+ {
2404
+ "epoch": 0.4493333333333333,
2405
+ "grad_norm": 0.34135122978923976,
2406
+ "learning_rate": 2.7057276954465484e-05,
2407
+ "loss": 1.9028,
2408
+ "step": 337
2409
+ },
2410
+ {
2411
+ "epoch": 0.45066666666666666,
2412
+ "grad_norm": 0.2422925649756551,
2413
+ "learning_rate": 2.703939299267186e-05,
2414
+ "loss": 2.3585,
2415
+ "step": 338
2416
+ },
2417
+ {
2418
+ "epoch": 0.452,
2419
+ "grad_norm": 0.21763822152397602,
2420
+ "learning_rate": 2.702146154201757e-05,
2421
+ "loss": 1.9407,
2422
+ "step": 339
2423
+ },
2424
+ {
2425
+ "epoch": 0.4533333333333333,
2426
+ "grad_norm": 0.26419913958274843,
2427
+ "learning_rate": 2.7003482683298935e-05,
2428
+ "loss": 2.0248,
2429
+ "step": 340
2430
+ },
2431
+ {
2432
+ "epoch": 0.45466666666666666,
2433
+ "grad_norm": 0.2463717390301776,
2434
+ "learning_rate": 2.698545649752588e-05,
2435
+ "loss": 2.0147,
2436
+ "step": 341
2437
+ },
2438
+ {
2439
+ "epoch": 0.456,
2440
+ "grad_norm": 0.21124149059138703,
2441
+ "learning_rate": 2.696738306592159e-05,
2442
+ "loss": 1.9508,
2443
+ "step": 342
2444
+ },
2445
+ {
2446
+ "epoch": 0.4573333333333333,
2447
+ "grad_norm": 0.2758662340977994,
2448
+ "learning_rate": 2.694926246992213e-05,
2449
+ "loss": 2.1917,
2450
+ "step": 343
2451
+ },
2452
+ {
2453
+ "epoch": 0.45866666666666667,
2454
+ "grad_norm": 0.30773115319803324,
2455
+ "learning_rate": 2.693109479117608e-05,
2456
+ "loss": 2.0674,
2457
+ "step": 344
2458
+ },
2459
+ {
2460
+ "epoch": 0.46,
2461
+ "grad_norm": 0.23761648514026573,
2462
+ "learning_rate": 2.6912880111544163e-05,
2463
+ "loss": 2.0919,
2464
+ "step": 345
2465
+ },
2466
+ {
2467
+ "epoch": 0.4613333333333333,
2468
+ "grad_norm": 0.25681238620857727,
2469
+ "learning_rate": 2.6894618513098882e-05,
2470
+ "loss": 1.9084,
2471
+ "step": 346
2472
+ },
2473
+ {
2474
+ "epoch": 0.46266666666666667,
2475
+ "grad_norm": 0.22830005506281845,
2476
+ "learning_rate": 2.687631007812415e-05,
2477
+ "loss": 2.056,
2478
+ "step": 347
2479
+ },
2480
+ {
2481
+ "epoch": 0.464,
2482
+ "grad_norm": 0.2947866578986673,
2483
+ "learning_rate": 2.6857954889114923e-05,
2484
+ "loss": 1.7343,
2485
+ "step": 348
2486
+ },
2487
+ {
2488
+ "epoch": 0.4653333333333333,
2489
+ "grad_norm": 0.24624687251309207,
2490
+ "learning_rate": 2.6839553028776817e-05,
2491
+ "loss": 2.2644,
2492
+ "step": 349
2493
+ },
2494
+ {
2495
+ "epoch": 0.4666666666666667,
2496
+ "grad_norm": 0.23775099815354966,
2497
+ "learning_rate": 2.682110458002575e-05,
2498
+ "loss": 2.0576,
2499
+ "step": 350
2500
+ },
2501
+ {
2502
+ "epoch": 0.468,
2503
+ "grad_norm": 0.24126580378308465,
2504
+ "learning_rate": 2.6802609625987548e-05,
2505
+ "loss": 2.1331,
2506
+ "step": 351
2507
+ },
2508
+ {
2509
+ "epoch": 0.4693333333333333,
2510
+ "grad_norm": 0.26152377053961967,
2511
+ "learning_rate": 2.6784068249997586e-05,
2512
+ "loss": 2.194,
2513
+ "step": 352
2514
+ },
2515
+ {
2516
+ "epoch": 0.4706666666666667,
2517
+ "grad_norm": 0.2670202218764803,
2518
+ "learning_rate": 2.676548053560042e-05,
2519
+ "loss": 1.885,
2520
+ "step": 353
2521
+ },
2522
+ {
2523
+ "epoch": 0.472,
2524
+ "grad_norm": 0.2227627113395291,
2525
+ "learning_rate": 2.674684656654938e-05,
2526
+ "loss": 2.2919,
2527
+ "step": 354
2528
+ },
2529
+ {
2530
+ "epoch": 0.47333333333333333,
2531
+ "grad_norm": 0.5386305755161744,
2532
+ "learning_rate": 2.6728166426806237e-05,
2533
+ "loss": 2.1106,
2534
+ "step": 355
2535
+ },
2536
+ {
2537
+ "epoch": 0.4746666666666667,
2538
+ "grad_norm": 0.2713934931123545,
2539
+ "learning_rate": 2.6709440200540778e-05,
2540
+ "loss": 1.9203,
2541
+ "step": 356
2542
+ },
2543
+ {
2544
+ "epoch": 0.476,
2545
+ "grad_norm": 0.2687463890320937,
2546
+ "learning_rate": 2.669066797213046e-05,
2547
+ "loss": 2.0278,
2548
+ "step": 357
2549
+ },
2550
+ {
2551
+ "epoch": 0.47733333333333333,
2552
+ "grad_norm": 0.2420523080214261,
2553
+ "learning_rate": 2.6671849826160018e-05,
2554
+ "loss": 1.9393,
2555
+ "step": 358
2556
+ },
2557
+ {
2558
+ "epoch": 0.4786666666666667,
2559
+ "grad_norm": 0.2400087971667011,
2560
+ "learning_rate": 2.6652985847421074e-05,
2561
+ "loss": 2.1867,
2562
+ "step": 359
2563
+ },
2564
+ {
2565
+ "epoch": 0.48,
2566
+ "grad_norm": 0.28320805189191467,
2567
+ "learning_rate": 2.663407612091178e-05,
2568
+ "loss": 1.8562,
2569
+ "step": 360
2570
+ },
2571
+ {
2572
+ "epoch": 0.48133333333333334,
2573
+ "grad_norm": 0.20819758118379753,
2574
+ "learning_rate": 2.6615120731836412e-05,
2575
+ "loss": 2.0647,
2576
+ "step": 361
2577
+ },
2578
+ {
2579
+ "epoch": 0.4826666666666667,
2580
+ "grad_norm": 0.19756628127478343,
2581
+ "learning_rate": 2.6596119765604996e-05,
2582
+ "loss": 2.1335,
2583
+ "step": 362
2584
+ },
2585
+ {
2586
+ "epoch": 0.484,
2587
+ "grad_norm": 0.26928306040248223,
2588
+ "learning_rate": 2.6577073307832925e-05,
2589
+ "loss": 2.0874,
2590
+ "step": 363
2591
+ },
2592
+ {
2593
+ "epoch": 0.48533333333333334,
2594
+ "grad_norm": 0.2164920362315045,
2595
+ "learning_rate": 2.655798144434056e-05,
2596
+ "loss": 2.1714,
2597
+ "step": 364
2598
+ },
2599
+ {
2600
+ "epoch": 0.4866666666666667,
2601
+ "grad_norm": 0.24919806333780237,
2602
+ "learning_rate": 2.6538844261152863e-05,
2603
+ "loss": 1.9509,
2604
+ "step": 365
2605
+ },
2606
+ {
2607
+ "epoch": 0.488,
2608
+ "grad_norm": 0.27032556044133776,
2609
+ "learning_rate": 2.6519661844498997e-05,
2610
+ "loss": 1.801,
2611
+ "step": 366
2612
+ },
2613
+ {
2614
+ "epoch": 0.48933333333333334,
2615
+ "grad_norm": 0.26804357878114354,
2616
+ "learning_rate": 2.650043428081194e-05,
2617
+ "loss": 2.1762,
2618
+ "step": 367
2619
+ },
2620
+ {
2621
+ "epoch": 0.49066666666666664,
2622
+ "grad_norm": 0.3572135575195324,
2623
+ "learning_rate": 2.6481161656728093e-05,
2624
+ "loss": 1.9907,
2625
+ "step": 368
2626
+ },
2627
+ {
2628
+ "epoch": 0.492,
2629
+ "grad_norm": 0.20724767007565056,
2630
+ "learning_rate": 2.646184405908689e-05,
2631
+ "loss": 2.0845,
2632
+ "step": 369
2633
+ },
2634
+ {
2635
+ "epoch": 0.49333333333333335,
2636
+ "grad_norm": 0.29461252876218635,
2637
+ "learning_rate": 2.6442481574930417e-05,
2638
+ "loss": 2.0182,
2639
+ "step": 370
2640
+ },
2641
+ {
2642
+ "epoch": 0.49466666666666664,
2643
+ "grad_norm": 0.32302958691645034,
2644
+ "learning_rate": 2.6423074291503e-05,
2645
+ "loss": 1.8637,
2646
+ "step": 371
2647
+ },
2648
+ {
2649
+ "epoch": 0.496,
2650
+ "grad_norm": 0.21633487662675288,
2651
+ "learning_rate": 2.6403622296250843e-05,
2652
+ "loss": 2.1181,
2653
+ "step": 372
2654
+ },
2655
+ {
2656
+ "epoch": 0.49733333333333335,
2657
+ "grad_norm": 0.23982202171517725,
2658
+ "learning_rate": 2.6384125676821594e-05,
2659
+ "loss": 2.2663,
2660
+ "step": 373
2661
+ },
2662
+ {
2663
+ "epoch": 0.49866666666666665,
2664
+ "grad_norm": 0.26674040749443334,
2665
+ "learning_rate": 2.636458452106398e-05,
2666
+ "loss": 1.9904,
2667
+ "step": 374
2668
+ },
2669
+ {
2670
+ "epoch": 0.5,
2671
+ "grad_norm": 0.20615794890547487,
2672
+ "learning_rate": 2.63449989170274e-05,
2673
+ "loss": 2.2052,
2674
+ "step": 375
2675
+ },
2676
+ {
2677
+ "epoch": 0.5,
2678
+ "eval_loss": 1.7428412437438965,
2679
+ "eval_runtime": 98.8155,
2680
+ "eval_samples_per_second": 1.012,
2681
+ "eval_steps_per_second": 0.253,
2682
+ "step": 375
2683
+ }
2684
+ ],
2685
+ "logging_steps": 1,
2686
+ "max_steps": 1500,
2687
+ "num_input_tokens_seen": 0,
2688
+ "num_train_epochs": 2,
2689
+ "save_steps": 375,
2690
+ "stateful_callbacks": {
2691
+ "TrainerControl": {
2692
+ "args": {
2693
+ "should_epoch_stop": false,
2694
+ "should_evaluate": false,
2695
+ "should_log": false,
2696
+ "should_save": true,
2697
+ "should_training_stop": false
2698
+ },
2699
+ "attributes": {}
2700
+ }
2701
+ },
2702
+ "total_flos": 87275077632000.0,
2703
+ "train_batch_size": 1,
2704
+ "trial_name": null,
2705
+ "trial_params": null
2706
+ }
checkpoint-375/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a971df54a1fea7e372afc85de88dc0b114ebd6f71ee0a1f0c747c7a6c10a7c8
3
+ size 8568
checkpoint-375/zero_to_fp32.py ADDED
@@ -0,0 +1,760 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ # Copyright (c) Microsoft Corporation.
4
+ # SPDX-License-Identifier: Apache-2.0
5
+
6
+ # DeepSpeed Team
7
+
8
+ # This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
9
+ # copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
10
+ # the future. Once extracted, the weights don't require DeepSpeed and can be used in any
11
+ # application.
12
+ #
13
+ # example:
14
+ # python zero_to_fp32.py . output_dir/
15
+ # or
16
+ # python zero_to_fp32.py . output_dir/ --safe_serialization
17
+
18
+ import argparse
19
+ import torch
20
+ import glob
21
+ import math
22
+ import os
23
+ import re
24
+ import gc
25
+ import json
26
+ import numpy as np
27
+ from tqdm import tqdm
28
+ from collections import OrderedDict
29
+ from dataclasses import dataclass
30
+
31
+ # while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
32
+ # DeepSpeed data structures it has to be available in the current python environment.
33
+ from deepspeed.utils import logger
34
+ from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
35
+ FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
36
+ FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
37
+
38
+
39
+ @dataclass
40
+ class zero_model_state:
41
+ buffers: dict()
42
+ param_shapes: dict()
43
+ shared_params: list
44
+ ds_version: int
45
+ frozen_param_shapes: dict()
46
+ frozen_param_fragments: dict()
47
+
48
+
49
+ debug = 0
50
+
51
+ # load to cpu
52
+ device = torch.device('cpu')
53
+
54
+
55
+ def atoi(text):
56
+ return int(text) if text.isdigit() else text
57
+
58
+
59
+ def natural_keys(text):
60
+ '''
61
+ alist.sort(key=natural_keys) sorts in human order
62
+ http://nedbatchelder.com/blog/200712/human_sorting.html
63
+ (See Toothy's implementation in the comments)
64
+ '''
65
+ return [atoi(c) for c in re.split(r'(\d+)', text)]
66
+
67
+
68
+ def get_model_state_file(checkpoint_dir, zero_stage):
69
+ if not os.path.isdir(checkpoint_dir):
70
+ raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
71
+
72
+ # there should be only one file
73
+ if zero_stage <= 2:
74
+ file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
75
+ elif zero_stage == 3:
76
+ file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
77
+
78
+ if not os.path.exists(file):
79
+ raise FileNotFoundError(f"can't find model states file at '{file}'")
80
+
81
+ return file
82
+
83
+
84
+ def get_checkpoint_files(checkpoint_dir, glob_pattern):
85
+ # XXX: need to test that this simple glob rule works for multi-node setup too
86
+ ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
87
+
88
+ if len(ckpt_files) == 0:
89
+ raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
90
+
91
+ return ckpt_files
92
+
93
+
94
+ def get_optim_files(checkpoint_dir):
95
+ return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
96
+
97
+
98
+ def get_model_state_files(checkpoint_dir):
99
+ return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
100
+
101
+
102
+ def parse_model_states(files):
103
+ zero_model_states = []
104
+ for file in files:
105
+ state_dict = torch.load(file, map_location=device, weights_only=False)
106
+
107
+ if BUFFER_NAMES not in state_dict:
108
+ raise ValueError(f"{file} is not a model state checkpoint")
109
+ buffer_names = state_dict[BUFFER_NAMES]
110
+ if debug:
111
+ print("Found buffers:", buffer_names)
112
+
113
+ # recover just the buffers while restoring them to fp32 if they were saved in fp16
114
+ buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
115
+ param_shapes = state_dict[PARAM_SHAPES]
116
+
117
+ # collect parameters that are included in param_shapes
118
+ param_names = []
119
+ for s in param_shapes:
120
+ for name in s.keys():
121
+ param_names.append(name)
122
+
123
+ # update with frozen parameters
124
+ frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
125
+ if frozen_param_shapes is not None:
126
+ if debug:
127
+ print(f"Found frozen_param_shapes: {frozen_param_shapes}")
128
+ param_names += list(frozen_param_shapes.keys())
129
+
130
+ # handle shared params
131
+ shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
132
+
133
+ ds_version = state_dict.get(DS_VERSION, None)
134
+
135
+ frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
136
+
137
+ z_model_state = zero_model_state(buffers=buffers,
138
+ param_shapes=param_shapes,
139
+ shared_params=shared_params,
140
+ ds_version=ds_version,
141
+ frozen_param_shapes=frozen_param_shapes,
142
+ frozen_param_fragments=frozen_param_fragments)
143
+ zero_model_states.append(z_model_state)
144
+
145
+ return zero_model_states
146
+
147
+
148
+ def parse_optim_states(files, ds_checkpoint_dir):
149
+ total_files = len(files)
150
+ state_dicts = []
151
+ for f in tqdm(files, desc='Loading checkpoint shards'):
152
+ state_dict = torch.load(f, map_location=device, mmap=True, weights_only=False)
153
+ # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
154
+ # and also handle the case where it was already removed by another helper script
155
+ state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
156
+ state_dicts.append(state_dict)
157
+
158
+ if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
159
+ raise ValueError(f"{files[0]} is not a zero checkpoint")
160
+ zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
161
+ world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
162
+
163
+ # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
164
+ # parameters can be different from data parallelism for non-expert parameters. So we can just
165
+ # use the max of the partition_count to get the dp world_size.
166
+
167
+ if type(world_size) is list:
168
+ world_size = max(world_size)
169
+
170
+ if world_size != total_files:
171
+ raise ValueError(
172
+ f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
173
+ "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
174
+ )
175
+
176
+ # the groups are named differently in each stage
177
+ if zero_stage <= 2:
178
+ fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
179
+ elif zero_stage == 3:
180
+ fp32_groups_key = FP32_FLAT_GROUPS
181
+ else:
182
+ raise ValueError(f"unknown zero stage {zero_stage}")
183
+
184
+ fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
185
+ return zero_stage, world_size, fp32_flat_groups
186
+
187
+
188
+ def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
189
+ """
190
+ Returns fp32 state_dict reconstructed from ds checkpoint
191
+
192
+ Args:
193
+ - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
194
+
195
+ """
196
+ print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
197
+
198
+ optim_files = get_optim_files(ds_checkpoint_dir)
199
+ zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
200
+ print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
201
+
202
+ model_files = get_model_state_files(ds_checkpoint_dir)
203
+
204
+ zero_model_states = parse_model_states(model_files)
205
+ print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
206
+
207
+ if zero_stage <= 2:
208
+ return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
209
+ exclude_frozen_parameters)
210
+ elif zero_stage == 3:
211
+ return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
212
+ exclude_frozen_parameters)
213
+
214
+
215
+ def _zero2_merge_frozen_params(state_dict, zero_model_states):
216
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
217
+ return
218
+
219
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
220
+ frozen_param_fragments = zero_model_states[0].frozen_param_fragments
221
+
222
+ if debug:
223
+ num_elem = sum(s.numel() for s in frozen_param_shapes.values())
224
+ print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
225
+
226
+ wanted_params = len(frozen_param_shapes)
227
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
228
+ avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
229
+ print(f'Frozen params: Have {avail_numel} numels to process.')
230
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
231
+
232
+ total_params = 0
233
+ total_numel = 0
234
+ for name, shape in frozen_param_shapes.items():
235
+ total_params += 1
236
+ unpartitioned_numel = shape.numel()
237
+ total_numel += unpartitioned_numel
238
+
239
+ state_dict[name] = frozen_param_fragments[name]
240
+
241
+ if debug:
242
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
243
+
244
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
245
+
246
+
247
+ def _has_callable(obj, fn):
248
+ attr = getattr(obj, fn, None)
249
+ return callable(attr)
250
+
251
+
252
+ def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
253
+ param_shapes = zero_model_states[0].param_shapes
254
+
255
+ # Reconstruction protocol:
256
+ #
257
+ # XXX: document this
258
+
259
+ if debug:
260
+ for i in range(world_size):
261
+ for j in range(len(fp32_flat_groups[0])):
262
+ print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
263
+
264
+ # XXX: memory usage doubles here (zero2)
265
+ num_param_groups = len(fp32_flat_groups[0])
266
+ merged_single_partition_of_fp32_groups = []
267
+ for i in range(num_param_groups):
268
+ merged_partitions = [sd[i] for sd in fp32_flat_groups]
269
+ full_single_fp32_vector = torch.cat(merged_partitions, 0)
270
+ merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
271
+ avail_numel = sum(
272
+ [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
273
+
274
+ if debug:
275
+ wanted_params = sum([len(shapes) for shapes in param_shapes])
276
+ wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
277
+ # not asserting if there is a mismatch due to possible padding
278
+ print(f"Have {avail_numel} numels to process.")
279
+ print(f"Need {wanted_numel} numels in {wanted_params} params.")
280
+
281
+ # params
282
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
283
+ # out-of-core computing solution
284
+ total_numel = 0
285
+ total_params = 0
286
+ for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
287
+ offset = 0
288
+ avail_numel = full_single_fp32_vector.numel()
289
+ for name, shape in shapes.items():
290
+
291
+ unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
292
+ total_numel += unpartitioned_numel
293
+ total_params += 1
294
+
295
+ if debug:
296
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
297
+ state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
298
+ offset += unpartitioned_numel
299
+
300
+ # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
301
+ # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
302
+ # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
303
+ # live optimizer object, so we are checking that the numbers are within the right range
304
+ align_to = 2 * world_size
305
+
306
+ def zero2_align(x):
307
+ return align_to * math.ceil(x / align_to)
308
+
309
+ if debug:
310
+ print(f"original offset={offset}, avail_numel={avail_numel}")
311
+
312
+ offset = zero2_align(offset)
313
+ avail_numel = zero2_align(avail_numel)
314
+
315
+ if debug:
316
+ print(f"aligned offset={offset}, avail_numel={avail_numel}")
317
+
318
+ # Sanity check
319
+ if offset != avail_numel:
320
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
321
+
322
+ print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
323
+
324
+
325
+ def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
326
+ exclude_frozen_parameters):
327
+ state_dict = OrderedDict()
328
+
329
+ # buffers
330
+ buffers = zero_model_states[0].buffers
331
+ state_dict.update(buffers)
332
+ if debug:
333
+ print(f"added {len(buffers)} buffers")
334
+
335
+ if not exclude_frozen_parameters:
336
+ _zero2_merge_frozen_params(state_dict, zero_model_states)
337
+
338
+ _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
339
+
340
+ # recover shared parameters
341
+ for pair in zero_model_states[0].shared_params:
342
+ if pair[1] in state_dict:
343
+ state_dict[pair[0]] = state_dict[pair[1]]
344
+
345
+ return state_dict
346
+
347
+
348
+ def zero3_partitioned_param_info(unpartitioned_numel, world_size):
349
+ remainder = unpartitioned_numel % world_size
350
+ padding_numel = (world_size - remainder) if remainder else 0
351
+ partitioned_numel = math.ceil(unpartitioned_numel / world_size)
352
+ return partitioned_numel, padding_numel
353
+
354
+
355
+ def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
356
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
357
+ return
358
+
359
+ if debug:
360
+ for i in range(world_size):
361
+ num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
362
+ print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
363
+
364
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
365
+ wanted_params = len(frozen_param_shapes)
366
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
367
+ avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
368
+ print(f'Frozen params: Have {avail_numel} numels to process.')
369
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
370
+
371
+ total_params = 0
372
+ total_numel = 0
373
+ for name, shape in zero_model_states[0].frozen_param_shapes.items():
374
+ total_params += 1
375
+ unpartitioned_numel = shape.numel()
376
+ total_numel += unpartitioned_numel
377
+
378
+ param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
379
+ state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
380
+
381
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
382
+
383
+ if debug:
384
+ print(
385
+ f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
386
+ )
387
+
388
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
389
+
390
+
391
+ class GatheredTensor:
392
+ """
393
+ A pseudo tensor that collects partitioned weights.
394
+ It is more memory efficient when there are multiple groups.
395
+ """
396
+
397
+ def __init__(self, flat_groups, flat_groups_offset, offset, partitioned_numel, shape):
398
+ self.flat_groups = flat_groups
399
+ self.flat_groups_offset = flat_groups_offset
400
+ self.offset = offset
401
+ self.partitioned_numel = partitioned_numel
402
+ self.shape = shape
403
+ self.dtype = self.flat_groups[0][0].dtype
404
+
405
+ def contiguous(self):
406
+ """
407
+ Merge partitioned weights from flat_groups into a single tensor.
408
+ """
409
+ end_idx = self.offset + self.partitioned_numel
410
+ world_size = len(self.flat_groups)
411
+ pad_flat_param_chunks = []
412
+
413
+ for rank_i in range(world_size):
414
+ # for each rank, we need to collect weights from related group/groups
415
+ flat_groups_at_rank_i = self.flat_groups[rank_i]
416
+ start_group_id = None
417
+ end_group_id = None
418
+ for group_id in range(len(self.flat_groups_offset)):
419
+ if self.flat_groups_offset[group_id] <= self.offset < self.flat_groups_offset[group_id + 1]:
420
+ start_group_id = group_id
421
+ if self.flat_groups_offset[group_id] < end_idx <= self.flat_groups_offset[group_id + 1]:
422
+ end_group_id = group_id
423
+ break
424
+ # collect weights from related group/groups
425
+ for group_id in range(start_group_id, end_group_id + 1):
426
+ flat_tensor = flat_groups_at_rank_i[group_id]
427
+ start_offset = self.offset - self.flat_groups_offset[group_id]
428
+ end_offset = min(end_idx, self.flat_groups_offset[group_id + 1]) - self.flat_groups_offset[group_id]
429
+ pad_flat_param_chunks.append(flat_tensor[start_offset:end_offset])
430
+
431
+ # collect weights from all ranks
432
+ pad_flat_param = torch.cat(pad_flat_param_chunks, dim=0)
433
+ param = pad_flat_param[:self.shape.numel()].view(self.shape).contiguous()
434
+ return param
435
+
436
+
437
+ def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
438
+ param_shapes = zero_model_states[0].param_shapes
439
+ avail_numel = sum([flat_group.numel() for flat_group in fp32_flat_groups[0]]) * world_size
440
+
441
+ # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
442
+ # param, re-consolidating each param, while dealing with padding if any
443
+
444
+ # merge list of dicts, preserving order
445
+ param_shapes = {k: v for d in param_shapes for k, v in d.items()}
446
+
447
+ if debug:
448
+ for i in range(world_size):
449
+ print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
450
+
451
+ wanted_params = len(param_shapes)
452
+ wanted_numel = sum(shape.numel() for shape in param_shapes.values())
453
+ # not asserting if there is a mismatch due to possible padding
454
+ avail_numel = fp32_flat_groups[0].numel() * world_size
455
+ print(f"Trainable params: Have {avail_numel} numels to process.")
456
+ print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
457
+
458
+ # params
459
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
460
+ # out-of-core computing solution
461
+ offset = 0
462
+ total_numel = 0
463
+ total_params = 0
464
+ flat_groups_offset = [0] + list(np.cumsum([flat_tensor.numel() for flat_tensor in fp32_flat_groups[0]]))
465
+ for name, shape in tqdm(param_shapes.items(), desc='Gathering sharded weights'):
466
+ unpartitioned_numel = shape.numel()
467
+ total_numel += unpartitioned_numel
468
+ total_params += 1
469
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
470
+
471
+ if debug:
472
+ print(
473
+ f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
474
+ )
475
+
476
+ # memory efficient tensor
477
+ tensor = GatheredTensor(fp32_flat_groups, flat_groups_offset, offset, partitioned_numel, shape)
478
+ state_dict[name] = tensor
479
+ offset += partitioned_numel
480
+
481
+ offset *= world_size
482
+
483
+ # Sanity check
484
+ if offset != avail_numel:
485
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
486
+
487
+ print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
488
+
489
+
490
+ def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
491
+ exclude_frozen_parameters):
492
+ state_dict = OrderedDict()
493
+
494
+ # buffers
495
+ buffers = zero_model_states[0].buffers
496
+ state_dict.update(buffers)
497
+ if debug:
498
+ print(f"added {len(buffers)} buffers")
499
+
500
+ if not exclude_frozen_parameters:
501
+ _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
502
+
503
+ _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
504
+
505
+ # recover shared parameters
506
+ for pair in zero_model_states[0].shared_params:
507
+ if pair[1] in state_dict:
508
+ state_dict[pair[0]] = state_dict[pair[1]]
509
+
510
+ return state_dict
511
+
512
+
513
+ def to_torch_tensor(state_dict, return_empty_tensor=False):
514
+ """
515
+ Convert state_dict of GatheredTensor to torch tensor
516
+ """
517
+ torch_state_dict = {}
518
+ converted_tensors = {}
519
+ for name, tensor in state_dict.items():
520
+ tensor_id = id(tensor)
521
+ if tensor_id in converted_tensors: # shared tensors
522
+ shared_tensor = torch_state_dict[converted_tensors[tensor_id]]
523
+ torch_state_dict[name] = shared_tensor
524
+ else:
525
+ converted_tensors[tensor_id] = name
526
+ if return_empty_tensor:
527
+ torch_state_dict[name] = torch.empty(tensor.shape, dtype=tensor.dtype)
528
+ else:
529
+ torch_state_dict[name] = tensor.contiguous()
530
+ return torch_state_dict
531
+
532
+
533
+ def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir,
534
+ tag=None,
535
+ exclude_frozen_parameters=False,
536
+ lazy_mode=False):
537
+ """
538
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
539
+ ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
540
+ via a model hub.
541
+
542
+ Args:
543
+ - ``checkpoint_dir``: path to the desired checkpoint folder
544
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
545
+ - ``exclude_frozen_parameters``: exclude frozen parameters
546
+ - ``lazy_mode``: get state_dict in lazy mode. It returns a dict of pesduo tensor instead of torch tensor, which is more memory efficient.
547
+ Convert the pesduo tensor to torch tensor by ``.contiguous()``
548
+
549
+ Returns:
550
+ - pytorch ``state_dict``
551
+
552
+ A typical usage might be ::
553
+
554
+ from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
555
+ # do the training and checkpoint saving
556
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
557
+ model = model.cpu() # move to cpu
558
+ model.load_state_dict(state_dict)
559
+ # submit to model hub or save the model to share with others
560
+
561
+ In this example the ``model`` will no longer be usable in the deepspeed context of the same
562
+ application. i.e. you will need to re-initialize the deepspeed engine, since
563
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
564
+
565
+ If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
566
+
567
+ Note: the above usage may not work if your application doesn't have sufficient free CPU memory.
568
+ You may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
569
+ the checkpoint. Or you can load state_dict in lazy mode ::
570
+
571
+ from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
572
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, lazy_mode=True) # not on cpu
573
+ for name, lazy_tensor in state_dict.item():
574
+ tensor = lazy_tensor.contiguous() # to cpu
575
+ print(name, tensor)
576
+ # del tensor to release memory if it no longer in use
577
+ """
578
+ if tag is None:
579
+ latest_path = os.path.join(checkpoint_dir, 'latest')
580
+ if os.path.isfile(latest_path):
581
+ with open(latest_path, 'r') as fd:
582
+ tag = fd.read().strip()
583
+ else:
584
+ raise ValueError(f"Unable to find 'latest' file at {latest_path}")
585
+
586
+ ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
587
+
588
+ if not os.path.isdir(ds_checkpoint_dir):
589
+ raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
590
+
591
+ state_dict = _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
592
+ if lazy_mode:
593
+ return state_dict
594
+ else:
595
+ return to_torch_tensor(state_dict)
596
+
597
+
598
+ def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir,
599
+ output_dir,
600
+ max_shard_size="5GB",
601
+ safe_serialization=False,
602
+ tag=None,
603
+ exclude_frozen_parameters=False):
604
+ """
605
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
606
+ loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
607
+
608
+ Args:
609
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
610
+ - ``output_dir``: directory to the pytorch fp32 state_dict output files
611
+ - ``max_shard_size``: the maximum size for a checkpoint before being sharded, default value is 5GB
612
+ - ``safe_serialization``: whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).
613
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
614
+ - ``exclude_frozen_parameters``: exclude frozen parameters
615
+ """
616
+
617
+ # Dependency pre-check
618
+ if safe_serialization:
619
+ try:
620
+ from safetensors.torch import save_file
621
+ except ImportError:
622
+ print('If you want to use `safe_serialization`, please `pip install safetensors`')
623
+ raise
624
+ if max_shard_size is not None:
625
+ try:
626
+ from huggingface_hub import split_torch_state_dict_into_shards
627
+ except ImportError:
628
+ print('If you want to use `max_shard_size`, please `pip install huggingface_hub`')
629
+ raise
630
+
631
+ # Convert zero checkpoint to state_dict
632
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir,
633
+ tag,
634
+ exclude_frozen_parameters,
635
+ lazy_mode=True)
636
+
637
+ # Shard the model if it is too big.
638
+ weights_name = "model.safetensors" if safe_serialization else "pytorch_model.bin"
639
+ if max_shard_size is not None:
640
+ filename_pattern = weights_name.replace(".bin", "{suffix}.bin").replace(".safetensors", "{suffix}.safetensors")
641
+ # an memory-efficient approach for sharding
642
+ empty_state_dict = to_torch_tensor(state_dict, return_empty_tensor=True)
643
+ state_dict_split = split_torch_state_dict_into_shards(empty_state_dict,
644
+ filename_pattern=filename_pattern,
645
+ max_shard_size=max_shard_size)
646
+ else:
647
+ from collections import namedtuple
648
+ StateDictSplit = namedtuple("StateDictSplit", ["is_sharded", "filename_to_tensors"])
649
+ state_dict_split = StateDictSplit(is_sharded=False,
650
+ filename_to_tensors={weights_name: list(state_dict.keys())})
651
+
652
+ # Save the model by shard
653
+ os.makedirs(output_dir, exist_ok=True)
654
+ filename_to_tensors = state_dict_split.filename_to_tensors.items()
655
+ for shard_file, tensors in tqdm(filename_to_tensors, desc="Saving checkpoint shards"):
656
+ shard_state_dict = {tensor_name: state_dict[tensor_name] for tensor_name in tensors}
657
+ shard_state_dict = to_torch_tensor(shard_state_dict)
658
+ output_path = os.path.join(output_dir, shard_file)
659
+ if safe_serialization:
660
+ save_file(shard_state_dict, output_path, metadata={"format": "pt"})
661
+ else:
662
+ torch.save(shard_state_dict, output_path)
663
+ # release the memory of current shard
664
+ for tensor_name in list(shard_state_dict.keys()):
665
+ del state_dict[tensor_name]
666
+ del shard_state_dict[tensor_name]
667
+ del shard_state_dict
668
+ gc.collect()
669
+
670
+ # Save index if sharded
671
+ if state_dict_split.is_sharded:
672
+ index = {
673
+ "metadata": state_dict_split.metadata,
674
+ "weight_map": state_dict_split.tensor_to_filename,
675
+ }
676
+ save_index_file = "model.safetensors.index.json" if safe_serialization else "pytorch_model.bin.index.json"
677
+ save_index_file = os.path.join(output_dir, save_index_file)
678
+ with open(save_index_file, "w", encoding="utf-8") as f:
679
+ content = json.dumps(index, indent=2, sort_keys=True) + "\n"
680
+ f.write(content)
681
+
682
+
683
+ def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
684
+ """
685
+ 1. Put the provided model to cpu
686
+ 2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
687
+ 3. Load it into the provided model
688
+
689
+ Args:
690
+ - ``model``: the model object to update
691
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
692
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
693
+
694
+ Returns:
695
+ - ``model`: modified model
696
+
697
+ Make sure you have plenty of CPU memory available before you call this function. If you don't
698
+ have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
699
+ conveniently placed for you in the checkpoint folder.
700
+
701
+ A typical usage might be ::
702
+
703
+ from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
704
+ model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
705
+ # submit to model hub or save the model to share with others
706
+
707
+ Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
708
+ of the same application. i.e. you will need to re-initialize the deepspeed engine, since
709
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
710
+
711
+ """
712
+ logger.info(f"Extracting fp32 weights")
713
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
714
+
715
+ logger.info(f"Overwriting model with fp32 weights")
716
+ model = model.cpu()
717
+ model.load_state_dict(state_dict, strict=False)
718
+
719
+ return model
720
+
721
+
722
+ if __name__ == "__main__":
723
+ parser = argparse.ArgumentParser()
724
+ parser.add_argument("checkpoint_dir",
725
+ type=str,
726
+ help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
727
+ parser.add_argument("output_dir",
728
+ type=str,
729
+ help="directory to the pytorch fp32 state_dict output files"
730
+ "(e.g. path/checkpoint-12-output/)")
731
+ parser.add_argument(
732
+ "--max_shard_size",
733
+ type=str,
734
+ default="5GB",
735
+ help="The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size"
736
+ "lower than this size. If expressed as a string, needs to be digits followed by a unit (like `5MB`"
737
+ "We default it to 5GB in order for models to be able to run easily on free-tier google colab instances"
738
+ "without CPU OOM issues.")
739
+ parser.add_argument(
740
+ "--safe_serialization",
741
+ default=False,
742
+ action='store_true',
743
+ help="Whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).")
744
+ parser.add_argument("-t",
745
+ "--tag",
746
+ type=str,
747
+ default=None,
748
+ help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
749
+ parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
750
+ parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
751
+ args = parser.parse_args()
752
+
753
+ debug = args.debug
754
+
755
+ convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
756
+ args.output_dir,
757
+ max_shard_size=args.max_shard_size,
758
+ safe_serialization=args.safe_serialization,
759
+ tag=args.tag,
760
+ exclude_frozen_parameters=args.exclude_frozen_parameters)