lalalaDa commited on
Commit
014bb1e
·
verified ·
1 Parent(s): c2b5586

Model save

Browse files
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ model_name: ER-GRPO-alpha99
4
+ tags:
5
+ - generated_from_trainer
6
+ - trl
7
+ - grpo
8
+ licence: license
9
+ ---
10
+
11
+ # Model Card for ER-GRPO-alpha99
12
+
13
+ This model is a fine-tuned version of [None](https://huggingface.co/None).
14
+ It has been trained using [TRL](https://github.com/huggingface/trl).
15
+
16
+ ## Quick start
17
+
18
+ ```python
19
+ from transformers import pipeline
20
+
21
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
22
+ generator = pipeline("text-generation", model="lalalaDa/ER-GRPO-alpha99", device="cuda")
23
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
24
+ print(output["generated_text"])
25
+ ```
26
+
27
+ ## Training procedure
28
+
29
+
30
+
31
+
32
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
33
+
34
+ ### Framework versions
35
+
36
+ - TRL: 0.18.1
37
+ - Transformers: 4.52.4
38
+ - Pytorch: 2.5.1
39
+ - Datasets: 3.6.0
40
+ - Tokenizers: 0.21.1
41
+
42
+ ## Citations
43
+
44
+ Cite GRPO as:
45
+
46
+ ```bibtex
47
+ @article{zhihong2024deepseekmath,
48
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
49
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
50
+ year = 2024,
51
+ eprint = {arXiv:2402.03300},
52
+ }
53
+
54
+ ```
55
+
56
+ Cite TRL as:
57
+
58
+ ```bibtex
59
+ @misc{vonwerra2022trl,
60
+ title = {{TRL: Transformer Reinforcement Learning}},
61
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
62
+ year = 2020,
63
+ journal = {GitHub repository},
64
+ publisher = {GitHub},
65
+ howpublished = {\url{https://github.com/huggingface/trl}}
66
+ }
67
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00044901110231876373,
4
+ "train_runtime": 4526.0548,
5
+ "train_samples": 7000,
6
+ "train_samples_per_second": 0.53,
7
+ "train_steps_per_second": 0.011
8
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151646,
4
+ "do_sample": true,
5
+ "eos_token_id": 151643,
6
+ "temperature": 0.6,
7
+ "top_p": 0.95,
8
+ "transformers_version": "4.52.4"
9
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00044901110231876373,
4
+ "train_runtime": 4526.0548,
5
+ "train_samples": 7000,
6
+ "train_samples_per_second": 0.53,
7
+ "train_steps_per_second": 0.011
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1493 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 0.05714285714285714,
6
+ "eval_steps": 500,
7
+ "global_step": 50,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "clip_ratio/high_max": 0.0,
14
+ "clip_ratio/high_mean": 0.0,
15
+ "clip_ratio/low_mean": 0.0,
16
+ "clip_ratio/low_min": 0.0,
17
+ "clip_ratio/region_mean": 0.0,
18
+ "completions/clipped_ratio": 0.5208333333333333,
19
+ "completions/max_length": 3584.0,
20
+ "completions/max_terminated_length": 3128.0,
21
+ "completions/mean_length": 2584.104248046875,
22
+ "completions/mean_terminated_length": 1497.2608642578125,
23
+ "completions/min_length": 557.0,
24
+ "completions/min_terminated_length": 557.0,
25
+ "epoch": 0.001142857142857143,
26
+ "frac_reward_zero_std": 0.0,
27
+ "grad_norm": 0.26198074221611023,
28
+ "kl": 0.0,
29
+ "learning_rate": 0.0,
30
+ "loss": -0.0022,
31
+ "num_tokens": 131153.0,
32
+ "policy_entropy_avg": 8.125,
33
+ "reward": 0.3948305547237396,
34
+ "reward_std": 0.7732391357421875,
35
+ "rewards/cosine_scaled_reward/mean": -0.062009382992982864,
36
+ "rewards/cosine_scaled_reward/std": 0.43048128485679626,
37
+ "rewards/format_reward/mean": 0.5208333134651184,
38
+ "rewards/format_reward/std": 0.504852294921875,
39
+ "step": 1
40
+ },
41
+ {
42
+ "clip_ratio/high_max": 0.0,
43
+ "clip_ratio/high_mean": 0.0,
44
+ "clip_ratio/low_mean": 0.0,
45
+ "clip_ratio/low_min": 0.0,
46
+ "clip_ratio/region_mean": 0.0,
47
+ "completions/clipped_ratio": 0.5833333333333333,
48
+ "completions/max_length": 3584.0,
49
+ "completions/max_terminated_length": 3280.0,
50
+ "completions/mean_length": 2761.666748046875,
51
+ "completions/mean_terminated_length": 1610.4000244140625,
52
+ "completions/min_length": 465.0,
53
+ "completions/min_terminated_length": 465.0,
54
+ "epoch": 0.002285714285714286,
55
+ "frac_reward_zero_std": 0.0,
56
+ "grad_norm": 0.2314005047082901,
57
+ "kl": 0.0,
58
+ "learning_rate": 2e-07,
59
+ "loss": -0.0045,
60
+ "num_tokens": 271243.0,
61
+ "policy_entropy_avg": 8.125,
62
+ "reward": 0.4077601432800293,
63
+ "reward_std": 0.8425893187522888,
64
+ "rewards/cosine_scaled_reward/mean": -0.003428752301260829,
65
+ "rewards/cosine_scaled_reward/std": 0.4935320317745209,
66
+ "rewards/format_reward/mean": 0.4166666567325592,
67
+ "rewards/format_reward/std": 0.49822381138801575,
68
+ "step": 2
69
+ },
70
+ {
71
+ "clip_ratio/high_max": 0.0,
72
+ "clip_ratio/high_mean": 0.0,
73
+ "clip_ratio/low_mean": 0.0,
74
+ "clip_ratio/low_min": 0.0,
75
+ "clip_ratio/region_mean": 0.0,
76
+ "completions/clipped_ratio": 0.875,
77
+ "completions/max_length": 3584.0,
78
+ "completions/max_terminated_length": 2945.0,
79
+ "completions/mean_length": 3343.33349609375,
80
+ "completions/mean_terminated_length": 1658.666748046875,
81
+ "completions/min_length": 490.0,
82
+ "completions/min_terminated_length": 490.0,
83
+ "epoch": 0.0034285714285714284,
84
+ "frac_reward_zero_std": 0.0,
85
+ "grad_norm": 0.19728681445121765,
86
+ "kl": 0.0006656646728515625,
87
+ "learning_rate": 4e-07,
88
+ "loss": 0.0095,
89
+ "num_tokens": 439577.0,
90
+ "policy_entropy_avg": 8.125,
91
+ "reward": -0.15455231070518494,
92
+ "reward_std": 0.5764515995979309,
93
+ "rewards/cosine_scaled_reward/mean": -0.17141447961330414,
94
+ "rewards/cosine_scaled_reward/std": 0.32203689217567444,
95
+ "rewards/format_reward/mean": 0.1875,
96
+ "rewards/format_reward/std": 0.3944427967071533,
97
+ "step": 3
98
+ },
99
+ {
100
+ "clip_ratio/high_max": 0.0,
101
+ "clip_ratio/high_mean": 0.0,
102
+ "clip_ratio/low_mean": 0.0,
103
+ "clip_ratio/low_min": 0.0,
104
+ "clip_ratio/region_mean": 0.0,
105
+ "completions/clipped_ratio": 0.39583333333333337,
106
+ "completions/max_length": 3584.0,
107
+ "completions/max_terminated_length": 3458.0,
108
+ "completions/mean_length": 2226.89599609375,
109
+ "completions/mean_terminated_length": 1337.7586669921875,
110
+ "completions/min_length": 407.0,
111
+ "completions/min_terminated_length": 407.0,
112
+ "epoch": 0.004571428571428572,
113
+ "frac_reward_zero_std": 0.0,
114
+ "grad_norm": 0.2950705587863922,
115
+ "kl": 0.0006043116251627604,
116
+ "learning_rate": 6e-07,
117
+ "loss": -0.001,
118
+ "num_tokens": 553824.0,
119
+ "policy_entropy_avg": 8.125,
120
+ "reward": 0.4680083394050598,
121
+ "reward_std": 0.8357078433036804,
122
+ "rewards/cosine_scaled_reward/mean": -0.09815327078104019,
123
+ "rewards/cosine_scaled_reward/std": 0.399366170167923,
124
+ "rewards/format_reward/mean": 0.6666666865348816,
125
+ "rewards/format_reward/std": 0.47639307379722595,
126
+ "step": 4
127
+ },
128
+ {
129
+ "clip_ratio/high_max": 0.0,
130
+ "clip_ratio/high_mean": 0.0,
131
+ "clip_ratio/low_mean": 0.0,
132
+ "clip_ratio/low_min": 0.0,
133
+ "clip_ratio/region_mean": 0.0,
134
+ "completions/clipped_ratio": 0.7083333333333333,
135
+ "completions/max_length": 3584.0,
136
+ "completions/max_terminated_length": 2603.0,
137
+ "completions/mean_length": 3089.104248046875,
138
+ "completions/mean_terminated_length": 1887.21435546875,
139
+ "completions/min_length": 909.0,
140
+ "completions/min_terminated_length": 909.0,
141
+ "epoch": 0.005714285714285714,
142
+ "frac_reward_zero_std": 0.0,
143
+ "grad_norm": 0.2482743263244629,
144
+ "kl": 0.000629425048828125,
145
+ "learning_rate": 8e-07,
146
+ "loss": 0.0028,
147
+ "num_tokens": 710213.0,
148
+ "policy_entropy_avg": 8.125,
149
+ "reward": -0.06966459006071091,
150
+ "reward_std": 0.7608852386474609,
151
+ "rewards/cosine_scaled_reward/mean": -0.20167399942874908,
152
+ "rewards/cosine_scaled_reward/std": 0.3204644024372101,
153
+ "rewards/format_reward/mean": 0.3333333432674408,
154
+ "rewards/format_reward/std": 0.47639307379722595,
155
+ "step": 5
156
+ },
157
+ {
158
+ "clip_ratio/high_max": 0.0,
159
+ "clip_ratio/high_mean": 0.0,
160
+ "clip_ratio/low_mean": 0.0,
161
+ "clip_ratio/low_min": 0.0,
162
+ "clip_ratio/region_mean": 0.0,
163
+ "completions/clipped_ratio": 0.7916666666666666,
164
+ "completions/max_length": 3584.0,
165
+ "completions/max_terminated_length": 3421.0,
166
+ "completions/mean_length": 3119.52099609375,
167
+ "completions/mean_terminated_length": 1354.5,
168
+ "completions/min_length": 554.0,
169
+ "completions/min_terminated_length": 554.0,
170
+ "epoch": 0.006857142857142857,
171
+ "frac_reward_zero_std": 0.0,
172
+ "grad_norm": 0.24190759658813477,
173
+ "kl": 0.0006701151529947916,
174
+ "learning_rate": 1e-06,
175
+ "loss": 0.0018,
176
+ "num_tokens": 868686.0,
177
+ "policy_entropy_avg": 8.125,
178
+ "reward": 0.026811789721250534,
179
+ "reward_std": 0.7506579756736755,
180
+ "rewards/cosine_scaled_reward/mean": -0.1427767425775528,
181
+ "rewards/cosine_scaled_reward/std": 0.3361252248287201,
182
+ "rewards/format_reward/mean": 0.3125,
183
+ "rewards/format_reward/std": 0.4684174358844757,
184
+ "step": 6
185
+ },
186
+ {
187
+ "clip_ratio/high_max": 0.0,
188
+ "clip_ratio/high_mean": 0.0,
189
+ "clip_ratio/low_mean": 0.0,
190
+ "clip_ratio/low_min": 0.0,
191
+ "clip_ratio/region_mean": 0.0,
192
+ "completions/clipped_ratio": 0.5416666666666667,
193
+ "completions/max_length": 3584.0,
194
+ "completions/max_terminated_length": 3457.0,
195
+ "completions/mean_length": 3024.291748046875,
196
+ "completions/mean_terminated_length": 2362.818359375,
197
+ "completions/min_length": 839.0,
198
+ "completions/min_terminated_length": 839.0,
199
+ "epoch": 0.008,
200
+ "frac_reward_zero_std": 0.0,
201
+ "grad_norm": 0.20822857320308685,
202
+ "kl": 0.0005512237548828125,
203
+ "learning_rate": 9.989038226169207e-07,
204
+ "loss": -0.009,
205
+ "num_tokens": 1021658.0,
206
+ "policy_entropy_avg": 8.125,
207
+ "reward": 0.47669005393981934,
208
+ "reward_std": 0.9081848859786987,
209
+ "rewards/cosine_scaled_reward/mean": -0.031290601938962936,
210
+ "rewards/cosine_scaled_reward/std": 0.47983497381210327,
211
+ "rewards/format_reward/mean": 0.5416666865348816,
212
+ "rewards/format_reward/std": 0.5035336017608643,
213
+ "step": 7
214
+ },
215
+ {
216
+ "clip_ratio/high_max": 0.0,
217
+ "clip_ratio/high_mean": 0.0,
218
+ "clip_ratio/low_mean": 0.0,
219
+ "clip_ratio/low_min": 0.0,
220
+ "clip_ratio/region_mean": 0.0,
221
+ "completions/clipped_ratio": 0.6041666666666667,
222
+ "completions/max_length": 3584.0,
223
+ "completions/max_terminated_length": 3568.0,
224
+ "completions/mean_length": 2791.875,
225
+ "completions/mean_terminated_length": 1582.8421630859375,
226
+ "completions/min_length": 327.0,
227
+ "completions/min_terminated_length": 327.0,
228
+ "epoch": 0.009142857142857144,
229
+ "frac_reward_zero_std": 0.0,
230
+ "grad_norm": 0.23235374689102173,
231
+ "kl": 0.0005970001220703125,
232
+ "learning_rate": 9.956206309337066e-07,
233
+ "loss": -0.0081,
234
+ "num_tokens": 1163480.0,
235
+ "policy_entropy_avg": 8.125,
236
+ "reward": 0.5300650596618652,
237
+ "reward_std": 0.7924127578735352,
238
+ "rewards/cosine_scaled_reward/mean": 0.03719766065478325,
239
+ "rewards/cosine_scaled_reward/std": 0.4377634525299072,
240
+ "rewards/format_reward/mean": 0.4583333432674408,
241
+ "rewards/format_reward/std": 0.5035336017608643,
242
+ "step": 8
243
+ },
244
+ {
245
+ "clip_ratio/high_max": 0.0,
246
+ "clip_ratio/high_mean": 0.0,
247
+ "clip_ratio/low_mean": 0.0,
248
+ "clip_ratio/low_min": 0.0,
249
+ "clip_ratio/region_mean": 0.0,
250
+ "completions/clipped_ratio": 0.7083333333333333,
251
+ "completions/max_length": 3584.0,
252
+ "completions/max_terminated_length": 3494.0,
253
+ "completions/mean_length": 3142.95849609375,
254
+ "completions/mean_terminated_length": 2071.857177734375,
255
+ "completions/min_length": 955.0,
256
+ "completions/min_terminated_length": 955.0,
257
+ "epoch": 0.010285714285714285,
258
+ "frac_reward_zero_std": 0.0,
259
+ "grad_norm": 0.21945622563362122,
260
+ "kl": 0.0006663004557291666,
261
+ "learning_rate": 9.901664203302124e-07,
262
+ "loss": 0.002,
263
+ "num_tokens": 1322934.0,
264
+ "policy_entropy_avg": 8.125,
265
+ "reward": 0.09029825031757355,
266
+ "reward_std": 0.8250617980957031,
267
+ "rewards/cosine_scaled_reward/mean": -0.1421239972114563,
268
+ "rewards/cosine_scaled_reward/std": 0.3718816637992859,
269
+ "rewards/format_reward/mean": 0.375,
270
+ "rewards/format_reward/std": 0.48924607038497925,
271
+ "step": 9
272
+ },
273
+ {
274
+ "clip_ratio/high_max": 0.0,
275
+ "clip_ratio/high_mean": 0.0,
276
+ "clip_ratio/low_mean": 0.0,
277
+ "clip_ratio/low_min": 0.0,
278
+ "clip_ratio/region_mean": 0.0,
279
+ "completions/clipped_ratio": 0.625,
280
+ "completions/max_length": 3584.0,
281
+ "completions/max_terminated_length": 3440.0,
282
+ "completions/mean_length": 2639.791748046875,
283
+ "completions/mean_terminated_length": 1066.111083984375,
284
+ "completions/min_length": 329.0,
285
+ "completions/min_terminated_length": 329.0,
286
+ "epoch": 0.011428571428571429,
287
+ "frac_reward_zero_std": 0.0,
288
+ "grad_norm": 0.2775964140892029,
289
+ "kl": 0.0005779266357421875,
290
+ "learning_rate": 9.825677631722435e-07,
291
+ "loss": -0.0111,
292
+ "num_tokens": 1457768.0,
293
+ "policy_entropy_avg": 8.125,
294
+ "reward": 0.31791985034942627,
295
+ "reward_std": 0.7219366431236267,
296
+ "rewards/cosine_scaled_reward/mean": -0.03815798461437225,
297
+ "rewards/cosine_scaled_reward/std": 0.4010634124279022,
298
+ "rewards/format_reward/mean": 0.3958333432674408,
299
+ "rewards/format_reward/std": 0.49420398473739624,
300
+ "step": 10
301
+ },
302
+ {
303
+ "clip_ratio/high_max": 0.0,
304
+ "clip_ratio/high_mean": 0.0,
305
+ "clip_ratio/low_mean": 0.0,
306
+ "clip_ratio/low_min": 0.0,
307
+ "clip_ratio/region_mean": 0.0,
308
+ "completions/clipped_ratio": 0.8125,
309
+ "completions/max_length": 3584.0,
310
+ "completions/max_terminated_length": 3528.0,
311
+ "completions/mean_length": 3260.8125,
312
+ "completions/mean_terminated_length": 1860.3333740234375,
313
+ "completions/min_length": 855.0,
314
+ "completions/min_terminated_length": 855.0,
315
+ "epoch": 0.012571428571428572,
316
+ "frac_reward_zero_std": 0.0,
317
+ "grad_norm": 0.21065565943717957,
318
+ "kl": 0.0005486806233723959,
319
+ "learning_rate": 9.728616793536587e-07,
320
+ "loss": 0.0145,
321
+ "num_tokens": 1623041.0,
322
+ "policy_entropy_avg": 8.135416666666666,
323
+ "reward": -0.1468753218650818,
324
+ "reward_std": 0.909512996673584,
325
+ "rewards/cosine_scaled_reward/mean": -0.18839001655578613,
326
+ "rewards/cosine_scaled_reward/std": 0.377286821603775,
327
+ "rewards/format_reward/mean": 0.2291666716337204,
328
+ "rewards/format_reward/std": 0.4247443675994873,
329
+ "step": 11
330
+ },
331
+ {
332
+ "clip_ratio/high_max": 0.0,
333
+ "clip_ratio/high_mean": 0.0,
334
+ "clip_ratio/low_mean": 0.0,
335
+ "clip_ratio/low_min": 0.0,
336
+ "clip_ratio/region_mean": 0.0,
337
+ "completions/clipped_ratio": 0.41666666666666663,
338
+ "completions/max_length": 3584.0,
339
+ "completions/max_terminated_length": 3564.0,
340
+ "completions/mean_length": 2480.791748046875,
341
+ "completions/mean_terminated_length": 1692.7857666015625,
342
+ "completions/min_length": 474.0,
343
+ "completions/min_terminated_length": 474.0,
344
+ "epoch": 0.013714285714285714,
345
+ "frac_reward_zero_std": 0.0,
346
+ "grad_norm": 0.33562034368515015,
347
+ "kl": 0.0005970001220703125,
348
+ "learning_rate": 9.610954559391704e-07,
349
+ "loss": 0.0138,
350
+ "num_tokens": 1750327.0,
351
+ "policy_entropy_avg": 8.125,
352
+ "reward": 0.425642192363739,
353
+ "reward_std": 0.823100745677948,
354
+ "rewards/cosine_scaled_reward/mean": -0.08819279819726944,
355
+ "rewards/cosine_scaled_reward/std": 0.44234269857406616,
356
+ "rewards/format_reward/mean": 0.6041666865348816,
357
+ "rewards/format_reward/std": 0.49420398473739624,
358
+ "step": 12
359
+ },
360
+ {
361
+ "clip_ratio/high_max": 0.0,
362
+ "clip_ratio/high_mean": 0.0,
363
+ "clip_ratio/low_mean": 0.0,
364
+ "clip_ratio/low_min": 0.0,
365
+ "clip_ratio/region_mean": 0.0,
366
+ "completions/clipped_ratio": 0.5833333333333333,
367
+ "completions/max_length": 3584.0,
368
+ "completions/max_terminated_length": 3538.0,
369
+ "completions/mean_length": 2816.14599609375,
370
+ "completions/mean_terminated_length": 1741.1500244140625,
371
+ "completions/min_length": 452.0,
372
+ "completions/min_terminated_length": 452.0,
373
+ "epoch": 0.014857142857142857,
374
+ "frac_reward_zero_std": 0.0,
375
+ "grad_norm": 0.27906712889671326,
376
+ "kl": 0.0005658467610677084,
377
+ "learning_rate": 9.473264167865171e-07,
378
+ "loss": 0.0016,
379
+ "num_tokens": 1893782.0,
380
+ "policy_entropy_avg": 8.125,
381
+ "reward": 0.24736499786376953,
382
+ "reward_std": 0.7155070304870605,
383
+ "rewards/cosine_scaled_reward/mean": -0.09444598108530045,
384
+ "rewards/cosine_scaled_reward/std": 0.4492030441761017,
385
+ "rewards/format_reward/mean": 0.4375,
386
+ "rewards/format_reward/std": 0.5013279914855957,
387
+ "step": 13
388
+ },
389
+ {
390
+ "clip_ratio/high_max": 0.0,
391
+ "clip_ratio/high_mean": 0.0,
392
+ "clip_ratio/low_mean": 0.0,
393
+ "clip_ratio/low_min": 0.0,
394
+ "clip_ratio/region_mean": 0.0,
395
+ "completions/clipped_ratio": 0.5833333333333333,
396
+ "completions/max_length": 3584.0,
397
+ "completions/max_terminated_length": 3369.0,
398
+ "completions/mean_length": 2769.0,
399
+ "completions/mean_terminated_length": 1628.0,
400
+ "completions/min_length": 555.0,
401
+ "completions/min_terminated_length": 555.0,
402
+ "epoch": 0.016,
403
+ "frac_reward_zero_std": 0.0,
404
+ "grad_norm": 0.27333828806877136,
405
+ "kl": 0.000553131103515625,
406
+ "learning_rate": 9.316216432703916e-07,
407
+ "loss": 0.0017,
408
+ "num_tokens": 2034650.0,
409
+ "policy_entropy_avg": 8.125,
410
+ "reward": 0.1486106812953949,
411
+ "reward_std": 0.8035473227500916,
412
+ "rewards/cosine_scaled_reward/mean": -0.1336546093225479,
413
+ "rewards/cosine_scaled_reward/std": 0.3953794538974762,
414
+ "rewards/format_reward/mean": 0.4166666567325592,
415
+ "rewards/format_reward/std": 0.49822381138801575,
416
+ "step": 14
417
+ },
418
+ {
419
+ "clip_ratio/high_max": 0.0,
420
+ "clip_ratio/high_mean": 0.0,
421
+ "clip_ratio/low_mean": 0.0,
422
+ "clip_ratio/low_min": 0.0,
423
+ "clip_ratio/region_mean": 0.0,
424
+ "completions/clipped_ratio": 0.625,
425
+ "completions/max_length": 3584.0,
426
+ "completions/max_terminated_length": 3186.0,
427
+ "completions/mean_length": 2703.08349609375,
428
+ "completions/mean_terminated_length": 1234.888916015625,
429
+ "completions/min_length": 405.0,
430
+ "completions/min_terminated_length": 405.0,
431
+ "epoch": 0.017142857142857144,
432
+ "frac_reward_zero_std": 0.0,
433
+ "grad_norm": 0.2734379768371582,
434
+ "kl": 0.0005137125651041666,
435
+ "learning_rate": 9.140576474687263e-07,
436
+ "loss": -0.0122,
437
+ "num_tokens": 2172588.0,
438
+ "policy_entropy_avg": 8.125,
439
+ "reward": 0.406665563583374,
440
+ "reward_std": 0.3276861608028412,
441
+ "rewards/cosine_scaled_reward/mean": 0.0168545451015234,
442
+ "rewards/cosine_scaled_reward/std": 0.4574853479862213,
443
+ "rewards/format_reward/mean": 0.375,
444
+ "rewards/format_reward/std": 0.48924607038497925,
445
+ "step": 15
446
+ },
447
+ {
448
+ "clip_ratio/high_max": 0.0,
449
+ "clip_ratio/high_mean": 0.0,
450
+ "clip_ratio/low_mean": 0.0,
451
+ "clip_ratio/low_min": 0.0,
452
+ "clip_ratio/region_mean": 0.0,
453
+ "completions/clipped_ratio": 0.9791666666666666,
454
+ "completions/max_length": 3584.0,
455
+ "completions/max_terminated_length": 2984.0,
456
+ "completions/mean_length": 3571.5,
457
+ "completions/mean_terminated_length": 2984.0,
458
+ "completions/min_length": 2984.0,
459
+ "completions/min_terminated_length": 2984.0,
460
+ "epoch": 0.018285714285714287,
461
+ "frac_reward_zero_std": 0.0,
462
+ "grad_norm": 0.1932428628206253,
463
+ "kl": 0.0006643931070963541,
464
+ "learning_rate": 8.9471999940354e-07,
465
+ "loss": 0.0075,
466
+ "num_tokens": 2351850.0,
467
+ "policy_entropy_avg": 8.135416666666666,
468
+ "reward": -0.3992506265640259,
469
+ "reward_std": 0.5042399168014526,
470
+ "rewards/cosine_scaled_reward/mean": -0.22146178781986237,
471
+ "rewards/cosine_scaled_reward/std": 0.292772501707077,
472
+ "rewards/format_reward/mean": 0.0416666679084301,
473
+ "rewards/format_reward/std": 0.20194092392921448,
474
+ "step": 16
475
+ },
476
+ {
477
+ "clip_ratio/high_max": 0.0,
478
+ "clip_ratio/high_mean": 0.0,
479
+ "clip_ratio/low_mean": 0.0,
480
+ "clip_ratio/low_min": 0.0,
481
+ "clip_ratio/region_mean": 0.0,
482
+ "completions/clipped_ratio": 0.39583333333333337,
483
+ "completions/max_length": 3584.0,
484
+ "completions/max_terminated_length": 3475.0,
485
+ "completions/mean_length": 2287.416748046875,
486
+ "completions/mean_terminated_length": 1437.9310302734375,
487
+ "completions/min_length": 364.0,
488
+ "completions/min_terminated_length": 364.0,
489
+ "epoch": 0.019428571428571427,
490
+ "frac_reward_zero_std": 0.0,
491
+ "grad_norm": 0.37555888295173645,
492
+ "kl": 0.0006338755289713541,
493
+ "learning_rate": 8.737029101523929e-07,
494
+ "loss": -0.0011,
495
+ "num_tokens": 2469536.0,
496
+ "policy_entropy_avg": 8.125,
497
+ "reward": 0.5107091665267944,
498
+ "reward_std": 0.8238445520401001,
499
+ "rewards/cosine_scaled_reward/mean": -0.04544559493660927,
500
+ "rewards/cosine_scaled_reward/std": 0.45671001076698303,
501
+ "rewards/format_reward/mean": 0.6041666865348816,
502
+ "rewards/format_reward/std": 0.49420398473739624,
503
+ "step": 17
504
+ },
505
+ {
506
+ "clip_ratio/high_max": 0.0,
507
+ "clip_ratio/high_mean": 0.0,
508
+ "clip_ratio/low_mean": 0.0,
509
+ "clip_ratio/low_min": 0.0,
510
+ "clip_ratio/region_mean": 0.0,
511
+ "completions/clipped_ratio": 0.6875,
512
+ "completions/max_length": 3584.0,
513
+ "completions/max_terminated_length": 3116.0,
514
+ "completions/mean_length": 2911.89599609375,
515
+ "completions/mean_terminated_length": 1433.2667236328125,
516
+ "completions/min_length": 608.0,
517
+ "completions/min_terminated_length": 608.0,
518
+ "epoch": 0.02057142857142857,
519
+ "frac_reward_zero_std": 0.0,
520
+ "grad_norm": 0.21075770258903503,
521
+ "kl": 0.0006434122721354166,
522
+ "learning_rate": 8.511087728614862e-07,
523
+ "loss": 0.0029,
524
+ "num_tokens": 2617089.0,
525
+ "policy_entropy_avg": 8.135416666666666,
526
+ "reward": -0.13376453518867493,
527
+ "reward_std": 0.6403241157531738,
528
+ "rewards/cosine_scaled_reward/mean": -0.2234683483839035,
529
+ "rewards/cosine_scaled_reward/std": 0.2743138074874878,
530
+ "rewards/format_reward/mean": 0.3125,
531
+ "rewards/format_reward/std": 0.4684174358844757,
532
+ "step": 18
533
+ },
534
+ {
535
+ "clip_ratio/high_max": 0.0,
536
+ "clip_ratio/high_mean": 0.0,
537
+ "clip_ratio/low_mean": 0.0,
538
+ "clip_ratio/low_min": 0.0,
539
+ "clip_ratio/region_mean": 0.0,
540
+ "completions/clipped_ratio": 0.5625,
541
+ "completions/max_length": 3584.0,
542
+ "completions/max_terminated_length": 3400.0,
543
+ "completions/mean_length": 2844.5,
544
+ "completions/mean_terminated_length": 1893.71435546875,
545
+ "completions/min_length": 504.0,
546
+ "completions/min_terminated_length": 504.0,
547
+ "epoch": 0.021714285714285714,
548
+ "frac_reward_zero_std": 0.0,
549
+ "grad_norm": 0.24317172169685364,
550
+ "kl": 0.0006122589111328125,
551
+ "learning_rate": 8.270476638965461e-07,
552
+ "loss": -0.0105,
553
+ "num_tokens": 2762067.0,
554
+ "policy_entropy_avg": 8.125,
555
+ "reward": 0.7856847643852234,
556
+ "reward_std": 0.5978894829750061,
557
+ "rewards/cosine_scaled_reward/mean": 0.15523308515548706,
558
+ "rewards/cosine_scaled_reward/std": 0.5373290181159973,
559
+ "rewards/format_reward/mean": 0.4791666567325592,
560
+ "rewards/format_reward/std": 0.5048523545265198,
561
+ "step": 19
562
+ },
563
+ {
564
+ "clip_ratio/high_max": 0.0,
565
+ "clip_ratio/high_mean": 0.0,
566
+ "clip_ratio/low_mean": 0.0,
567
+ "clip_ratio/low_min": 0.0,
568
+ "clip_ratio/region_mean": 0.0,
569
+ "completions/clipped_ratio": 0.47916666666666663,
570
+ "completions/max_length": 3584.0,
571
+ "completions/max_terminated_length": 3494.0,
572
+ "completions/mean_length": 2482.83349609375,
573
+ "completions/mean_terminated_length": 1469.760009765625,
574
+ "completions/min_length": 408.0,
575
+ "completions/min_terminated_length": 408.0,
576
+ "epoch": 0.022857142857142857,
577
+ "frac_reward_zero_std": 0.0,
578
+ "grad_norm": 0.26188376545906067,
579
+ "kl": 0.0005286534627278646,
580
+ "learning_rate": 8.01636806561836e-07,
581
+ "loss": -0.0042,
582
+ "num_tokens": 2889757.0,
583
+ "policy_entropy_avg": 8.125,
584
+ "reward": 0.5357545614242554,
585
+ "reward_std": 0.7750095129013062,
586
+ "rewards/cosine_scaled_reward/mean": -0.03285994753241539,
587
+ "rewards/cosine_scaled_reward/std": 0.4009867310523987,
588
+ "rewards/format_reward/mean": 0.6041666865348816,
589
+ "rewards/format_reward/std": 0.49420398473739624,
590
+ "step": 20
591
+ },
592
+ {
593
+ "clip_ratio/high_max": 0.0,
594
+ "clip_ratio/high_mean": 0.0,
595
+ "clip_ratio/low_mean": 0.0,
596
+ "clip_ratio/low_min": 0.0,
597
+ "clip_ratio/region_mean": 0.0,
598
+ "completions/clipped_ratio": 0.6041666666666667,
599
+ "completions/max_length": 3584.0,
600
+ "completions/max_terminated_length": 2710.0,
601
+ "completions/mean_length": 2631.70849609375,
602
+ "completions/mean_terminated_length": 1178.2105712890625,
603
+ "completions/min_length": 342.0,
604
+ "completions/min_terminated_length": 342.0,
605
+ "epoch": 0.024,
606
+ "frac_reward_zero_std": 0.0,
607
+ "grad_norm": 0.31193122267723083,
608
+ "kl": 0.0006847381591796875,
609
+ "learning_rate": 7.75e-07,
610
+ "loss": 0.0002,
611
+ "num_tokens": 3024185.0,
612
+ "policy_entropy_avg": 8.125,
613
+ "reward": 0.18238481879234314,
614
+ "reward_std": 0.4078831374645233,
615
+ "rewards/cosine_scaled_reward/mean": -0.11668267101049423,
616
+ "rewards/cosine_scaled_reward/std": 0.3962862193584442,
617
+ "rewards/format_reward/mean": 0.4166666567325592,
618
+ "rewards/format_reward/std": 0.49822381138801575,
619
+ "step": 21
620
+ },
621
+ {
622
+ "clip_ratio/high_max": 0.0,
623
+ "clip_ratio/high_mean": 0.0,
624
+ "clip_ratio/low_mean": 0.0,
625
+ "clip_ratio/low_min": 0.0,
626
+ "clip_ratio/region_mean": 0.0,
627
+ "completions/clipped_ratio": 0.27083333333333337,
628
+ "completions/max_length": 3584.0,
629
+ "completions/max_terminated_length": 3239.0,
630
+ "completions/mean_length": 1697.2083740234375,
631
+ "completions/mean_terminated_length": 996.4000244140625,
632
+ "completions/min_length": 250.0,
633
+ "completions/min_terminated_length": 250.0,
634
+ "epoch": 0.025142857142857144,
635
+ "frac_reward_zero_std": 0.0,
636
+ "grad_norm": 0.40179967880249023,
637
+ "kl": 0.0006052652994791666,
638
+ "learning_rate": 7.472670160550848e-07,
639
+ "loss": -0.005,
640
+ "num_tokens": 3112413.0,
641
+ "policy_entropy_avg": 8.125,
642
+ "reward": 0.6634411811828613,
643
+ "reward_std": 0.5782728791236877,
644
+ "rewards/cosine_scaled_reward/mean": -0.06244581937789917,
645
+ "rewards/cosine_scaled_reward/std": 0.4282727539539337,
646
+ "rewards/format_reward/mean": 0.7916666865348816,
647
+ "rewards/format_reward/std": 0.41041406989097595,
648
+ "step": 22
649
+ },
650
+ {
651
+ "clip_ratio/high_max": 0.0,
652
+ "clip_ratio/high_mean": 0.0,
653
+ "clip_ratio/low_mean": 0.0,
654
+ "clip_ratio/low_min": 0.0,
655
+ "clip_ratio/region_mean": 0.0,
656
+ "completions/clipped_ratio": 0.39583333333333337,
657
+ "completions/max_length": 3584.0,
658
+ "completions/max_terminated_length": 3421.0,
659
+ "completions/mean_length": 2181.104248046875,
660
+ "completions/mean_terminated_length": 1261.9654541015625,
661
+ "completions/min_length": 595.0,
662
+ "completions/min_terminated_length": 595.0,
663
+ "epoch": 0.026285714285714287,
664
+ "frac_reward_zero_std": 0.0,
665
+ "grad_norm": 0.3063512444496155,
666
+ "kl": 0.0006097157796223959,
667
+ "learning_rate": 7.185729670371604e-07,
668
+ "loss": 0.0002,
669
+ "num_tokens": 3225200.0,
670
+ "policy_entropy_avg": 8.125,
671
+ "reward": 0.3287263512611389,
672
+ "reward_std": 0.8908068537712097,
673
+ "rewards/cosine_scaled_reward/mean": -0.14731089770793915,
674
+ "rewards/cosine_scaled_reward/std": 0.42148637771606445,
675
+ "rewards/format_reward/mean": 0.625,
676
+ "rewards/format_reward/std": 0.48924607038497925,
677
+ "step": 23
678
+ },
679
+ {
680
+ "clip_ratio/high_max": 0.0,
681
+ "clip_ratio/high_mean": 0.0,
682
+ "clip_ratio/low_mean": 0.0,
683
+ "clip_ratio/low_min": 0.0,
684
+ "clip_ratio/region_mean": 0.0,
685
+ "completions/clipped_ratio": 0.5416666666666667,
686
+ "completions/max_length": 3584.0,
687
+ "completions/max_terminated_length": 3322.0,
688
+ "completions/mean_length": 2681.229248046875,
689
+ "completions/mean_terminated_length": 1614.3182373046875,
690
+ "completions/min_length": 461.0,
691
+ "completions/min_terminated_length": 461.0,
692
+ "epoch": 0.027428571428571427,
693
+ "frac_reward_zero_std": 0.0,
694
+ "grad_norm": 0.26260870695114136,
695
+ "kl": 0.0006230672200520834,
696
+ "learning_rate": 6.890576474687263e-07,
697
+ "loss": 0.001,
698
+ "num_tokens": 3362095.0,
699
+ "policy_entropy_avg": 8.125,
700
+ "reward": 0.3754323720932007,
701
+ "reward_std": 0.7894452810287476,
702
+ "rewards/cosine_scaled_reward/mean": -0.061340540647506714,
703
+ "rewards/cosine_scaled_reward/std": 0.4359513223171234,
704
+ "rewards/format_reward/mean": 0.5,
705
+ "rewards/format_reward/std": 0.5052911639213562,
706
+ "step": 24
707
+ },
708
+ {
709
+ "clip_ratio/high_max": 0.0,
710
+ "clip_ratio/high_mean": 0.0,
711
+ "clip_ratio/low_mean": 0.0,
712
+ "clip_ratio/low_min": 0.0,
713
+ "clip_ratio/region_mean": 0.0,
714
+ "completions/clipped_ratio": 0.5625,
715
+ "completions/max_length": 3584.0,
716
+ "completions/max_terminated_length": 3461.0,
717
+ "completions/mean_length": 2590.166748046875,
718
+ "completions/mean_terminated_length": 1312.3809814453125,
719
+ "completions/min_length": 532.0,
720
+ "completions/min_terminated_length": 532.0,
721
+ "epoch": 0.02857142857142857,
722
+ "frac_reward_zero_std": 0.0,
723
+ "grad_norm": 0.23011414706707,
724
+ "kl": 0.0007470448811848959,
725
+ "learning_rate": 6.588648530198504e-07,
726
+ "loss": 0.0021,
727
+ "num_tokens": 3494145.0,
728
+ "policy_entropy_avg": 8.125,
729
+ "reward": 0.3740345239639282,
730
+ "reward_std": 0.7695837020874023,
731
+ "rewards/cosine_scaled_reward/mean": -0.03079296462237835,
732
+ "rewards/cosine_scaled_reward/std": 0.44012707471847534,
733
+ "rewards/format_reward/mean": 0.4375,
734
+ "rewards/format_reward/std": 0.5013279914855957,
735
+ "step": 25
736
+ },
737
+ {
738
+ "clip_ratio/high_max": 0.0,
739
+ "clip_ratio/high_mean": 0.0,
740
+ "clip_ratio/low_mean": 0.0,
741
+ "clip_ratio/low_min": 0.0,
742
+ "clip_ratio/region_mean": 0.0,
743
+ "completions/clipped_ratio": 0.5625,
744
+ "completions/max_length": 3584.0,
745
+ "completions/max_terminated_length": 3395.0,
746
+ "completions/mean_length": 2929.666748046875,
747
+ "completions/mean_terminated_length": 2088.381103515625,
748
+ "completions/min_length": 879.0,
749
+ "completions/min_terminated_length": 879.0,
750
+ "epoch": 0.029714285714285714,
751
+ "frac_reward_zero_std": 0.0,
752
+ "grad_norm": 0.23103339970111847,
753
+ "kl": 0.0006039937337239584,
754
+ "learning_rate": 6.281416799501187e-07,
755
+ "loss": 0.0037,
756
+ "num_tokens": 3642743.0,
757
+ "policy_entropy_avg": 8.125,
758
+ "reward": 0.2619829773902893,
759
+ "reward_std": 0.6574144959449768,
760
+ "rewards/cosine_scaled_reward/mean": -0.10793358087539673,
761
+ "rewards/cosine_scaled_reward/std": 0.4338338077068329,
762
+ "rewards/format_reward/mean": 0.4791666567325592,
763
+ "rewards/format_reward/std": 0.5048523545265198,
764
+ "step": 26
765
+ },
766
+ {
767
+ "clip_ratio/high_max": 0.0,
768
+ "clip_ratio/high_mean": 0.0,
769
+ "clip_ratio/low_mean": 0.0,
770
+ "clip_ratio/low_min": 0.0,
771
+ "clip_ratio/region_mean": 0.0,
772
+ "completions/clipped_ratio": 0.6666666666666667,
773
+ "completions/max_length": 3584.0,
774
+ "completions/max_terminated_length": 3288.0,
775
+ "completions/mean_length": 2908.20849609375,
776
+ "completions/mean_terminated_length": 1556.625,
777
+ "completions/min_length": 518.0,
778
+ "completions/min_terminated_length": 518.0,
779
+ "epoch": 0.030857142857142857,
780
+ "frac_reward_zero_std": 0.0,
781
+ "grad_norm": 0.2881726324558258,
782
+ "kl": 0.000667572021484375,
783
+ "learning_rate": 5.97037808470444e-07,
784
+ "loss": -0.0,
785
+ "num_tokens": 3790053.0,
786
+ "policy_entropy_avg": 8.125,
787
+ "reward": 0.041869934648275375,
788
+ "reward_std": 0.7798717021942139,
789
+ "rewards/cosine_scaled_reward/mean": -0.1560431718826294,
790
+ "rewards/cosine_scaled_reward/std": 0.29862359166145325,
791
+ "rewards/format_reward/mean": 0.3541666567325592,
792
+ "rewards/format_reward/std": 0.4833211302757263,
793
+ "step": 27
794
+ },
795
+ {
796
+ "clip_ratio/high_max": 0.0,
797
+ "clip_ratio/high_mean": 0.0,
798
+ "clip_ratio/low_mean": 0.0,
799
+ "clip_ratio/low_min": 0.0,
800
+ "clip_ratio/region_mean": 0.0,
801
+ "completions/clipped_ratio": 0.6041666666666667,
802
+ "completions/max_length": 3584.0,
803
+ "completions/max_terminated_length": 2977.0,
804
+ "completions/mean_length": 2831.52099609375,
805
+ "completions/mean_terminated_length": 1683.0,
806
+ "completions/min_length": 509.0,
807
+ "completions/min_terminated_length": 509.0,
808
+ "epoch": 0.032,
809
+ "frac_reward_zero_std": 0.0,
810
+ "grad_norm": 0.24444623291492462,
811
+ "kl": 0.0005861918131510416,
812
+ "learning_rate": 5.657047735161255e-07,
813
+ "loss": 0.0062,
814
+ "num_tokens": 3933718.0,
815
+ "policy_entropy_avg": 8.125,
816
+ "reward": 0.4484003484249115,
817
+ "reward_std": 0.8719537258148193,
818
+ "rewards/cosine_scaled_reward/mean": 0.006576786283403635,
819
+ "rewards/cosine_scaled_reward/std": 0.4855944514274597,
820
+ "rewards/format_reward/mean": 0.4375,
821
+ "rewards/format_reward/std": 0.5013279914855957,
822
+ "step": 28
823
+ },
824
+ {
825
+ "clip_ratio/high_max": 0.0,
826
+ "clip_ratio/high_mean": 0.0,
827
+ "clip_ratio/low_mean": 0.0,
828
+ "clip_ratio/low_min": 0.0,
829
+ "clip_ratio/region_mean": 0.0,
830
+ "completions/clipped_ratio": 0.7916666666666666,
831
+ "completions/max_length": 3584.0,
832
+ "completions/max_terminated_length": 3091.0,
833
+ "completions/mean_length": 3182.02099609375,
834
+ "completions/mean_terminated_length": 1654.5,
835
+ "completions/min_length": 418.0,
836
+ "completions/min_terminated_length": 418.0,
837
+ "epoch": 0.03314285714285714,
838
+ "frac_reward_zero_std": 0.0,
839
+ "grad_norm": 0.22512345016002655,
840
+ "kl": 0.0006993611653645834,
841
+ "learning_rate": 5.342952264838747e-07,
842
+ "loss": 0.0099,
843
+ "num_tokens": 4094309.0,
844
+ "policy_entropy_avg": 8.125,
845
+ "reward": -0.1955069899559021,
846
+ "reward_std": 0.668115496635437,
847
+ "rewards/cosine_scaled_reward/mean": -0.21282805502414703,
848
+ "rewards/cosine_scaled_reward/std": 0.37752190232276917,
849
+ "rewards/format_reward/mean": 0.2291666716337204,
850
+ "rewards/format_reward/std": 0.4247443675994873,
851
+ "step": 29
852
+ },
853
+ {
854
+ "clip_ratio/high_max": 0.0,
855
+ "clip_ratio/high_mean": 0.0,
856
+ "clip_ratio/low_mean": 0.0,
857
+ "clip_ratio/low_min": 0.0,
858
+ "clip_ratio/region_mean": 0.0,
859
+ "completions/clipped_ratio": 0.5625,
860
+ "completions/max_length": 3584.0,
861
+ "completions/max_terminated_length": 3320.0,
862
+ "completions/mean_length": 2794.666748046875,
863
+ "completions/mean_terminated_length": 1779.8095703125,
864
+ "completions/min_length": 667.0,
865
+ "completions/min_terminated_length": 667.0,
866
+ "epoch": 0.03428571428571429,
867
+ "frac_reward_zero_std": 0.0,
868
+ "grad_norm": 0.22448024153709412,
869
+ "kl": 0.0006434122721354166,
870
+ "learning_rate": 5.02962191529556e-07,
871
+ "loss": 0.0021,
872
+ "num_tokens": 4236355.0,
873
+ "policy_entropy_avg": 8.125,
874
+ "reward": 0.38854461908340454,
875
+ "reward_std": 0.8984581828117371,
876
+ "rewards/cosine_scaled_reward/mean": -0.0443347692489624,
877
+ "rewards/cosine_scaled_reward/std": 0.44917213916778564,
878
+ "rewards/format_reward/mean": 0.4791666567325592,
879
+ "rewards/format_reward/std": 0.5048523545265198,
880
+ "step": 30
881
+ },
882
+ {
883
+ "clip_ratio/high_max": 0.0,
884
+ "clip_ratio/high_mean": 0.0,
885
+ "clip_ratio/low_mean": 0.0,
886
+ "clip_ratio/low_min": 0.0,
887
+ "clip_ratio/region_mean": 0.0,
888
+ "completions/clipped_ratio": 0.7708333333333334,
889
+ "completions/max_length": 3584.0,
890
+ "completions/max_terminated_length": 3570.0,
891
+ "completions/mean_length": 3039.39599609375,
892
+ "completions/mean_terminated_length": 1207.5455322265625,
893
+ "completions/min_length": 281.0,
894
+ "completions/min_terminated_length": 281.0,
895
+ "epoch": 0.03542857142857143,
896
+ "frac_reward_zero_std": 0.0,
897
+ "grad_norm": 0.22645282745361328,
898
+ "kl": 0.0006421407063802084,
899
+ "learning_rate": 4.7185832004988133e-07,
900
+ "loss": 0.009,
901
+ "num_tokens": 4390118.0,
902
+ "policy_entropy_avg": 8.125,
903
+ "reward": -0.09765049070119858,
904
+ "reward_std": 0.6962664127349854,
905
+ "rewards/cosine_scaled_reward/mean": -0.1740705966949463,
906
+ "rewards/cosine_scaled_reward/std": 0.4055609405040741,
907
+ "rewards/format_reward/mean": 0.25,
908
+ "rewards/format_reward/std": 0.4375949800014496,
909
+ "step": 31
910
+ },
911
+ {
912
+ "clip_ratio/high_max": 0.0,
913
+ "clip_ratio/high_mean": 0.0,
914
+ "clip_ratio/low_mean": 0.0,
915
+ "clip_ratio/low_min": 0.0,
916
+ "clip_ratio/region_mean": 0.0,
917
+ "completions/clipped_ratio": 0.5833333333333333,
918
+ "completions/max_length": 3584.0,
919
+ "completions/max_terminated_length": 3517.0,
920
+ "completions/mean_length": 3097.125,
921
+ "completions/mean_terminated_length": 2415.5,
922
+ "completions/min_length": 1046.0,
923
+ "completions/min_terminated_length": 1046.0,
924
+ "epoch": 0.036571428571428574,
925
+ "frac_reward_zero_std": 0.0,
926
+ "grad_norm": 0.19842347502708435,
927
+ "kl": 0.000629425048828125,
928
+ "learning_rate": 4.4113514698014953e-07,
929
+ "loss": -0.0137,
930
+ "num_tokens": 4546544.0,
931
+ "policy_entropy_avg": 8.125,
932
+ "reward": 0.6700344085693359,
933
+ "reward_std": 0.7424625158309937,
934
+ "rewards/cosine_scaled_reward/mean": 0.10753399133682251,
935
+ "rewards/cosine_scaled_reward/std": 0.5346410274505615,
936
+ "rewards/format_reward/mean": 0.4583333432674408,
937
+ "rewards/format_reward/std": 0.5035336017608643,
938
+ "step": 32
939
+ },
940
+ {
941
+ "clip_ratio/high_max": 0.0,
942
+ "clip_ratio/high_mean": 0.0,
943
+ "clip_ratio/low_mean": 0.0,
944
+ "clip_ratio/low_min": 0.0,
945
+ "clip_ratio/region_mean": 0.0,
946
+ "completions/clipped_ratio": 0.75,
947
+ "completions/max_length": 3584.0,
948
+ "completions/max_terminated_length": 3484.0,
949
+ "completions/mean_length": 3236.604248046875,
950
+ "completions/mean_terminated_length": 2194.416748046875,
951
+ "completions/min_length": 1039.0,
952
+ "completions/min_terminated_length": 1039.0,
953
+ "epoch": 0.037714285714285714,
954
+ "frac_reward_zero_std": 0.0,
955
+ "grad_norm": 0.19302453100681305,
956
+ "kl": 0.0005734761555989584,
957
+ "learning_rate": 4.1094235253127374e-07,
958
+ "loss": 0.0048,
959
+ "num_tokens": 4710313.0,
960
+ "policy_entropy_avg": 8.125,
961
+ "reward": -0.12012770771980286,
962
+ "reward_std": 0.6003495454788208,
963
+ "rewards/cosine_scaled_reward/mean": -0.1957823485136032,
964
+ "rewards/cosine_scaled_reward/std": 0.28730008006095886,
965
+ "rewards/format_reward/mean": 0.2708333432674408,
966
+ "rewards/format_reward/std": 0.4490928649902344,
967
+ "step": 33
968
+ },
969
+ {
970
+ "clip_ratio/high_max": 0.0,
971
+ "clip_ratio/high_mean": 0.0,
972
+ "clip_ratio/low_mean": 0.0,
973
+ "clip_ratio/low_min": 0.0,
974
+ "clip_ratio/region_mean": 0.0,
975
+ "completions/clipped_ratio": 0.45833333333333337,
976
+ "completions/max_length": 3584.0,
977
+ "completions/max_terminated_length": 3326.0,
978
+ "completions/mean_length": 2303.45849609375,
979
+ "completions/mean_terminated_length": 1219.923095703125,
980
+ "completions/min_length": 430.0,
981
+ "completions/min_terminated_length": 430.0,
982
+ "epoch": 0.038857142857142854,
983
+ "frac_reward_zero_std": 0.0,
984
+ "grad_norm": 0.273879736661911,
985
+ "kl": 0.0007279713948567709,
986
+ "learning_rate": 3.8142703296283953e-07,
987
+ "loss": 0.0012,
988
+ "num_tokens": 4828043.0,
989
+ "policy_entropy_avg": 8.125,
990
+ "reward": 0.5595396757125854,
991
+ "reward_std": 0.9288837313652039,
992
+ "rewards/cosine_scaled_reward/mean": -7.428725803038105e-05,
993
+ "rewards/cosine_scaled_reward/std": 0.5000401139259338,
994
+ "rewards/format_reward/mean": 0.5625,
995
+ "rewards/format_reward/std": 0.5013279914855957,
996
+ "step": 34
997
+ },
998
+ {
999
+ "clip_ratio/high_max": 0.0,
1000
+ "clip_ratio/high_mean": 0.0,
1001
+ "clip_ratio/low_mean": 0.0,
1002
+ "clip_ratio/low_min": 0.0,
1003
+ "clip_ratio/region_mean": 0.0,
1004
+ "completions/clipped_ratio": 0.7291666666666667,
1005
+ "completions/max_length": 3584.0,
1006
+ "completions/max_terminated_length": 3498.0,
1007
+ "completions/mean_length": 3022.52099609375,
1008
+ "completions/mean_terminated_length": 1510.84619140625,
1009
+ "completions/min_length": 417.0,
1010
+ "completions/min_terminated_length": 417.0,
1011
+ "epoch": 0.04,
1012
+ "frac_reward_zero_std": 0.0,
1013
+ "grad_norm": 0.25011301040649414,
1014
+ "kl": 0.0006783803304036459,
1015
+ "learning_rate": 3.5273298394491515e-07,
1016
+ "loss": 0.0054,
1017
+ "num_tokens": 4981746.0,
1018
+ "policy_entropy_avg": 8.125,
1019
+ "reward": 0.1530809849500656,
1020
+ "reward_std": 0.9632071256637573,
1021
+ "rewards/cosine_scaled_reward/mean": -0.07932490855455399,
1022
+ "rewards/cosine_scaled_reward/std": 0.4641749858856201,
1023
+ "rewards/format_reward/mean": 0.3125,
1024
+ "rewards/format_reward/std": 0.4684174358844757,
1025
+ "step": 35
1026
+ },
1027
+ {
1028
+ "clip_ratio/high_max": 0.0,
1029
+ "clip_ratio/high_mean": 0.0,
1030
+ "clip_ratio/low_mean": 0.0,
1031
+ "clip_ratio/low_min": 0.0,
1032
+ "clip_ratio/region_mean": 0.0,
1033
+ "completions/clipped_ratio": 0.8125,
1034
+ "completions/max_length": 3584.0,
1035
+ "completions/max_terminated_length": 3483.0,
1036
+ "completions/mean_length": 3256.479248046875,
1037
+ "completions/mean_terminated_length": 1837.2222900390625,
1038
+ "completions/min_length": 1007.0,
1039
+ "completions/min_terminated_length": 1007.0,
1040
+ "epoch": 0.04114285714285714,
1041
+ "frac_reward_zero_std": 0.0,
1042
+ "grad_norm": 0.21851785480976105,
1043
+ "kl": 0.0007712046305338541,
1044
+ "learning_rate": 3.250000000000001e-07,
1045
+ "loss": 0.0035,
1046
+ "num_tokens": 5146391.0,
1047
+ "policy_entropy_avg": 8.135416666666666,
1048
+ "reward": -0.2397136390209198,
1049
+ "reward_std": 0.3913343846797943,
1050
+ "rewards/cosine_scaled_reward/mean": -0.23504245281219482,
1051
+ "rewards/cosine_scaled_reward/std": 0.17867261171340942,
1052
+ "rewards/format_reward/mean": 0.2291666716337204,
1053
+ "rewards/format_reward/std": 0.4247443675994873,
1054
+ "step": 36
1055
+ },
1056
+ {
1057
+ "clip_ratio/high_max": 0.0,
1058
+ "clip_ratio/high_mean": 0.0,
1059
+ "clip_ratio/low_mean": 0.0,
1060
+ "clip_ratio/low_min": 0.0,
1061
+ "clip_ratio/region_mean": 0.0,
1062
+ "completions/clipped_ratio": 0.7708333333333334,
1063
+ "completions/max_length": 3584.0,
1064
+ "completions/max_terminated_length": 3151.0,
1065
+ "completions/mean_length": 3157.70849609375,
1066
+ "completions/mean_terminated_length": 1723.8182373046875,
1067
+ "completions/min_length": 749.0,
1068
+ "completions/min_terminated_length": 749.0,
1069
+ "epoch": 0.04228571428571429,
1070
+ "frac_reward_zero_std": 0.0,
1071
+ "grad_norm": 0.2363603562116623,
1072
+ "kl": 0.0006186167399088541,
1073
+ "learning_rate": 2.9836319343816397e-07,
1074
+ "loss": 0.0063,
1075
+ "num_tokens": 5306229.0,
1076
+ "policy_entropy_avg": 8.125,
1077
+ "reward": -0.23191750049591064,
1078
+ "reward_std": 0.505261242389679,
1079
+ "rewards/cosine_scaled_reward/mean": -0.24154144525527954,
1080
+ "rewards/cosine_scaled_reward/std": 0.23630201816558838,
1081
+ "rewards/format_reward/mean": 0.25,
1082
+ "rewards/format_reward/std": 0.4375949800014496,
1083
+ "step": 37
1084
+ },
1085
+ {
1086
+ "clip_ratio/high_max": 0.0,
1087
+ "clip_ratio/high_mean": 0.0,
1088
+ "clip_ratio/low_mean": 0.0,
1089
+ "clip_ratio/low_min": 0.0,
1090
+ "clip_ratio/region_mean": 0.0,
1091
+ "completions/clipped_ratio": 0.7916666666666666,
1092
+ "completions/max_length": 3584.0,
1093
+ "completions/max_terminated_length": 2515.0,
1094
+ "completions/mean_length": 3111.854248046875,
1095
+ "completions/mean_terminated_length": 1317.7000732421875,
1096
+ "completions/min_length": 679.0,
1097
+ "completions/min_terminated_length": 679.0,
1098
+ "epoch": 0.04342857142857143,
1099
+ "frac_reward_zero_std": 0.0,
1100
+ "grad_norm": 0.19270718097686768,
1101
+ "kl": 0.0006771087646484375,
1102
+ "learning_rate": 2.729523361034538e-07,
1103
+ "loss": -0.0004,
1104
+ "num_tokens": 5464382.0,
1105
+ "policy_entropy_avg": 8.125,
1106
+ "reward": 0.06942842155694962,
1107
+ "reward_std": 0.4463602900505066,
1108
+ "rewards/cosine_scaled_reward/mean": -0.07969469577074051,
1109
+ "rewards/cosine_scaled_reward/std": 0.3691597282886505,
1110
+ "rewards/format_reward/mean": 0.2291666716337204,
1111
+ "rewards/format_reward/std": 0.4247443675994873,
1112
+ "step": 38
1113
+ },
1114
+ {
1115
+ "clip_ratio/high_max": 0.0,
1116
+ "clip_ratio/high_mean": 0.0,
1117
+ "clip_ratio/low_mean": 0.0,
1118
+ "clip_ratio/low_min": 0.0,
1119
+ "clip_ratio/region_mean": 0.0,
1120
+ "completions/clipped_ratio": 0.5833333333333333,
1121
+ "completions/max_length": 3584.0,
1122
+ "completions/max_terminated_length": 3387.0,
1123
+ "completions/mean_length": 2799.166748046875,
1124
+ "completions/mean_terminated_length": 1700.4000244140625,
1125
+ "completions/min_length": 262.0,
1126
+ "completions/min_terminated_length": 262.0,
1127
+ "epoch": 0.044571428571428574,
1128
+ "frac_reward_zero_std": 0.0,
1129
+ "grad_norm": 0.2799771726131439,
1130
+ "kl": 0.0005861918131510416,
1131
+ "learning_rate": 2.488912271385139e-07,
1132
+ "loss": -0.0356,
1133
+ "num_tokens": 5606830.0,
1134
+ "policy_entropy_avg": 8.125,
1135
+ "reward": 0.41357025504112244,
1136
+ "reward_std": 0.3624642491340637,
1137
+ "rewards/cosine_scaled_reward/mean": -0.04217575863003731,
1138
+ "rewards/cosine_scaled_reward/std": 0.4245593845844269,
1139
+ "rewards/format_reward/mean": 0.5,
1140
+ "rewards/format_reward/std": 0.5052911639213562,
1141
+ "step": 39
1142
+ },
1143
+ {
1144
+ "clip_ratio/high_max": 0.0,
1145
+ "clip_ratio/high_mean": 0.0,
1146
+ "clip_ratio/low_mean": 0.0,
1147
+ "clip_ratio/low_min": 0.0,
1148
+ "clip_ratio/region_mean": 0.0,
1149
+ "completions/clipped_ratio": 0.5416666666666667,
1150
+ "completions/max_length": 3584.0,
1151
+ "completions/max_terminated_length": 2945.0,
1152
+ "completions/mean_length": 2401.95849609375,
1153
+ "completions/mean_terminated_length": 1005.0,
1154
+ "completions/min_length": 494.0,
1155
+ "completions/min_terminated_length": 494.0,
1156
+ "epoch": 0.045714285714285714,
1157
+ "frac_reward_zero_std": 0.0,
1158
+ "grad_norm": 0.2853446900844574,
1159
+ "kl": 0.000675201416015625,
1160
+ "learning_rate": 2.2629708984760706e-07,
1161
+ "loss": -0.0033,
1162
+ "num_tokens": 5729678.0,
1163
+ "policy_entropy_avg": 8.125,
1164
+ "reward": 0.2566830515861511,
1165
+ "reward_std": 0.573390781879425,
1166
+ "rewards/cosine_scaled_reward/mean": -0.11059689521789551,
1167
+ "rewards/cosine_scaled_reward/std": 0.43331247568130493,
1168
+ "rewards/format_reward/mean": 0.4791666567325592,
1169
+ "rewards/format_reward/std": 0.5048523545265198,
1170
+ "step": 40
1171
+ },
1172
+ {
1173
+ "clip_ratio/high_max": 0.0,
1174
+ "clip_ratio/high_mean": 0.0,
1175
+ "clip_ratio/low_mean": 0.0,
1176
+ "clip_ratio/low_min": 0.0,
1177
+ "clip_ratio/region_mean": 0.0,
1178
+ "completions/clipped_ratio": 0.625,
1179
+ "completions/max_length": 3584.0,
1180
+ "completions/max_terminated_length": 3370.0,
1181
+ "completions/mean_length": 2863.45849609375,
1182
+ "completions/mean_terminated_length": 1662.5555419921875,
1183
+ "completions/min_length": 762.0,
1184
+ "completions/min_terminated_length": 762.0,
1185
+ "epoch": 0.046857142857142854,
1186
+ "frac_reward_zero_std": 0.0,
1187
+ "grad_norm": 0.22030523419380188,
1188
+ "kl": 0.0006434122721354166,
1189
+ "learning_rate": 2.0528000059645995e-07,
1190
+ "loss": 0.0144,
1191
+ "num_tokens": 5875488.0,
1192
+ "policy_entropy_avg": 8.125,
1193
+ "reward": 0.050709377974271774,
1194
+ "reward_std": 0.7291332483291626,
1195
+ "rewards/cosine_scaled_reward/mean": -0.18285124003887177,
1196
+ "rewards/cosine_scaled_reward/std": 0.3616393804550171,
1197
+ "rewards/format_reward/mean": 0.4166666567325592,
1198
+ "rewards/format_reward/std": 0.49822381138801575,
1199
+ "step": 41
1200
+ },
1201
+ {
1202
+ "clip_ratio/high_max": 0.0,
1203
+ "clip_ratio/high_mean": 0.0,
1204
+ "clip_ratio/low_mean": 0.0,
1205
+ "clip_ratio/low_min": 0.0,
1206
+ "clip_ratio/region_mean": 0.0,
1207
+ "completions/clipped_ratio": 0.6875,
1208
+ "completions/max_length": 3584.0,
1209
+ "completions/max_terminated_length": 2370.0,
1210
+ "completions/mean_length": 2728.5,
1211
+ "completions/mean_terminated_length": 846.4000244140625,
1212
+ "completions/min_length": 207.0,
1213
+ "completions/min_terminated_length": 207.0,
1214
+ "epoch": 0.048,
1215
+ "frac_reward_zero_std": 0.0,
1216
+ "grad_norm": 0.3805939257144928,
1217
+ "kl": 0.0007483164469401041,
1218
+ "learning_rate": 1.8594235253127372e-07,
1219
+ "loss": 0.001,
1220
+ "num_tokens": 6014226.0,
1221
+ "policy_entropy_avg": 8.125,
1222
+ "reward": -0.11857573688030243,
1223
+ "reward_std": 0.35950538516044617,
1224
+ "rewards/cosine_scaled_reward/mean": -0.2158358097076416,
1225
+ "rewards/cosine_scaled_reward/std": 0.18257829546928406,
1226
+ "rewards/format_reward/mean": 0.3125,
1227
+ "rewards/format_reward/std": 0.4684174358844757,
1228
+ "step": 42
1229
+ },
1230
+ {
1231
+ "clip_ratio/high_max": 0.0,
1232
+ "clip_ratio/high_mean": 0.0,
1233
+ "clip_ratio/low_mean": 0.0,
1234
+ "clip_ratio/low_min": 0.0,
1235
+ "clip_ratio/region_mean": 0.0,
1236
+ "completions/clipped_ratio": 0.75,
1237
+ "completions/max_length": 3584.0,
1238
+ "completions/max_terminated_length": 2668.0,
1239
+ "completions/mean_length": 3001.479248046875,
1240
+ "completions/mean_terminated_length": 1253.916748046875,
1241
+ "completions/min_length": 529.0,
1242
+ "completions/min_terminated_length": 529.0,
1243
+ "epoch": 0.04914285714285714,
1244
+ "frac_reward_zero_std": 0.0,
1245
+ "grad_norm": 0.2410728931427002,
1246
+ "kl": 0.0007006327311197916,
1247
+ "learning_rate": 1.6837835672960831e-07,
1248
+ "loss": 0.0025,
1249
+ "num_tokens": 6167009.0,
1250
+ "policy_entropy_avg": 8.125,
1251
+ "reward": 0.13230201601982117,
1252
+ "reward_std": 0.6667492389678955,
1253
+ "rewards/cosine_scaled_reward/mean": -0.05851660296320915,
1254
+ "rewards/cosine_scaled_reward/std": 0.43021252751350403,
1255
+ "rewards/format_reward/mean": 0.25,
1256
+ "rewards/format_reward/std": 0.4375949800014496,
1257
+ "step": 43
1258
+ },
1259
+ {
1260
+ "clip_ratio/high_max": 0.0,
1261
+ "clip_ratio/high_mean": 0.0,
1262
+ "clip_ratio/low_mean": 0.0,
1263
+ "clip_ratio/low_min": 0.0,
1264
+ "clip_ratio/region_mean": 0.0,
1265
+ "completions/clipped_ratio": 0.6041666666666667,
1266
+ "completions/max_length": 3584.0,
1267
+ "completions/max_terminated_length": 3552.0,
1268
+ "completions/mean_length": 2645.104248046875,
1269
+ "completions/mean_terminated_length": 1212.0526123046875,
1270
+ "completions/min_length": 395.0,
1271
+ "completions/min_terminated_length": 395.0,
1272
+ "epoch": 0.05028571428571429,
1273
+ "frac_reward_zero_std": 0.0,
1274
+ "grad_norm": 0.2952722907066345,
1275
+ "kl": 0.0007654825846354166,
1276
+ "learning_rate": 1.5267358321348285e-07,
1277
+ "loss": 0.0014,
1278
+ "num_tokens": 6301996.0,
1279
+ "policy_entropy_avg": 8.135416666666666,
1280
+ "reward": 0.43852299451828003,
1281
+ "reward_std": 0.8475234508514404,
1282
+ "rewards/cosine_scaled_reward/mean": 0.0016132990131154656,
1283
+ "rewards/cosine_scaled_reward/std": 0.5085917711257935,
1284
+ "rewards/format_reward/mean": 0.4375,
1285
+ "rewards/format_reward/std": 0.5013279914855957,
1286
+ "step": 44
1287
+ },
1288
+ {
1289
+ "clip_ratio/high_max": 0.0,
1290
+ "clip_ratio/high_mean": 0.0,
1291
+ "clip_ratio/low_mean": 0.0,
1292
+ "clip_ratio/low_min": 0.0,
1293
+ "clip_ratio/region_mean": 0.0,
1294
+ "completions/clipped_ratio": 0.8541666666666666,
1295
+ "completions/max_length": 3584.0,
1296
+ "completions/max_terminated_length": 3505.0,
1297
+ "completions/mean_length": 3466.39599609375,
1298
+ "completions/mean_terminated_length": 2777.571533203125,
1299
+ "completions/min_length": 1678.0,
1300
+ "completions/min_terminated_length": 1678.0,
1301
+ "epoch": 0.05142857142857143,
1302
+ "frac_reward_zero_std": 0.0,
1303
+ "grad_norm": 0.19519105553627014,
1304
+ "kl": 0.0006815592447916666,
1305
+ "learning_rate": 1.3890454406082956e-07,
1306
+ "loss": 0.0045,
1307
+ "num_tokens": 6477125.0,
1308
+ "policy_entropy_avg": 8.125,
1309
+ "reward": 0.16751746833324432,
1310
+ "reward_std": 0.5252600312232971,
1311
+ "rewards/cosine_scaled_reward/mean": -0.030403709039092064,
1312
+ "rewards/cosine_scaled_reward/std": 0.44781333208084106,
1313
+ "rewards/format_reward/mean": 0.2291666716337204,
1314
+ "rewards/format_reward/std": 0.4247443675994873,
1315
+ "step": 45
1316
+ },
1317
+ {
1318
+ "clip_ratio/high_max": 0.0,
1319
+ "clip_ratio/high_mean": 0.0,
1320
+ "clip_ratio/low_mean": 0.0,
1321
+ "clip_ratio/low_min": 0.0,
1322
+ "clip_ratio/region_mean": 0.0,
1323
+ "completions/clipped_ratio": 0.7916666666666666,
1324
+ "completions/max_length": 3584.0,
1325
+ "completions/max_terminated_length": 3388.0,
1326
+ "completions/mean_length": 3097.77099609375,
1327
+ "completions/mean_terminated_length": 1250.0999755859375,
1328
+ "completions/min_length": 605.0,
1329
+ "completions/min_terminated_length": 605.0,
1330
+ "epoch": 0.052571428571428575,
1331
+ "frac_reward_zero_std": 0.0,
1332
+ "grad_norm": 0.2662387490272522,
1333
+ "kl": 0.0007890065511067709,
1334
+ "learning_rate": 1.2713832064634125e-07,
1335
+ "loss": 0.006,
1336
+ "num_tokens": 6634194.0,
1337
+ "policy_entropy_avg": 8.125,
1338
+ "reward": -0.2356199026107788,
1339
+ "reward_std": 0.4806956648826599,
1340
+ "rewards/cosine_scaled_reward/mean": -0.22256861627101898,
1341
+ "rewards/cosine_scaled_reward/std": 0.2471582442522049,
1342
+ "rewards/format_reward/mean": 0.2083333283662796,
1343
+ "rewards/format_reward/std": 0.41041409969329834,
1344
+ "step": 46
1345
+ },
1346
+ {
1347
+ "clip_ratio/high_max": 0.0,
1348
+ "clip_ratio/high_mean": 0.0,
1349
+ "clip_ratio/low_mean": 0.0,
1350
+ "clip_ratio/low_min": 0.0,
1351
+ "clip_ratio/region_mean": 0.0,
1352
+ "completions/clipped_ratio": 0.47916666666666663,
1353
+ "completions/max_length": 3584.0,
1354
+ "completions/max_terminated_length": 3500.0,
1355
+ "completions/mean_length": 2685.25,
1356
+ "completions/mean_terminated_length": 1858.39990234375,
1357
+ "completions/min_length": 431.0,
1358
+ "completions/min_terminated_length": 431.0,
1359
+ "epoch": 0.053714285714285714,
1360
+ "frac_reward_zero_std": 0.0,
1361
+ "grad_norm": 0.30878785252571106,
1362
+ "kl": 0.0005480448404947916,
1363
+ "learning_rate": 1.1743223682775649e-07,
1364
+ "loss": 0.0002,
1365
+ "num_tokens": 6770886.0,
1366
+ "policy_entropy_avg": 8.125,
1367
+ "reward": 0.628765344619751,
1368
+ "reward_std": 0.8911368250846863,
1369
+ "rewards/cosine_scaled_reward/mean": 0.04512912034988403,
1370
+ "rewards/cosine_scaled_reward/std": 0.5223999619483948,
1371
+ "rewards/format_reward/mean": 0.5416666865348816,
1372
+ "rewards/format_reward/std": 0.503533661365509,
1373
+ "step": 47
1374
+ },
1375
+ {
1376
+ "clip_ratio/high_max": 0.0,
1377
+ "clip_ratio/high_mean": 0.0,
1378
+ "clip_ratio/low_mean": 0.0,
1379
+ "clip_ratio/low_min": 0.0,
1380
+ "clip_ratio/region_mean": 0.0,
1381
+ "completions/clipped_ratio": 0.6875,
1382
+ "completions/max_length": 3584.0,
1383
+ "completions/max_terminated_length": 2532.0,
1384
+ "completions/mean_length": 2819.6875,
1385
+ "completions/mean_terminated_length": 1138.2000732421875,
1386
+ "completions/min_length": 705.0,
1387
+ "completions/min_terminated_length": 705.0,
1388
+ "epoch": 0.054857142857142854,
1389
+ "frac_reward_zero_std": 0.0,
1390
+ "grad_norm": 0.25418996810913086,
1391
+ "kl": 0.0007025400797526041,
1392
+ "learning_rate": 1.0983357966978745e-07,
1393
+ "loss": 0.0033,
1394
+ "num_tokens": 6914139.0,
1395
+ "policy_entropy_avg": 8.125,
1396
+ "reward": 0.09342099726200104,
1397
+ "reward_std": 0.7818130850791931,
1398
+ "rewards/cosine_scaled_reward/mean": -0.11972144991159439,
1399
+ "rewards/cosine_scaled_reward/std": 0.401507169008255,
1400
+ "rewards/format_reward/mean": 0.3333333432674408,
1401
+ "rewards/format_reward/std": 0.47639307379722595,
1402
+ "step": 48
1403
+ },
1404
+ {
1405
+ "clip_ratio/high_max": 0.0,
1406
+ "clip_ratio/high_mean": 0.0,
1407
+ "clip_ratio/low_mean": 0.0,
1408
+ "clip_ratio/low_min": 0.0,
1409
+ "clip_ratio/region_mean": 0.0,
1410
+ "completions/clipped_ratio": 0.47916666666666663,
1411
+ "completions/max_length": 3584.0,
1412
+ "completions/max_terminated_length": 3209.0,
1413
+ "completions/mean_length": 2395.6875,
1414
+ "completions/mean_terminated_length": 1302.43994140625,
1415
+ "completions/min_length": 326.0,
1416
+ "completions/min_terminated_length": 326.0,
1417
+ "epoch": 0.056,
1418
+ "frac_reward_zero_std": 0.0,
1419
+ "grad_norm": 0.2804883122444153,
1420
+ "kl": 0.0006554921468098959,
1421
+ "learning_rate": 1.0437936906629334e-07,
1422
+ "loss": -0.0017,
1423
+ "num_tokens": 7036680.0,
1424
+ "policy_entropy_avg": 8.125,
1425
+ "reward": 0.4345873296260834,
1426
+ "reward_std": 0.7855587005615234,
1427
+ "rewards/cosine_scaled_reward/mean": -0.06286442279815674,
1428
+ "rewards/cosine_scaled_reward/std": 0.4665209949016571,
1429
+ "rewards/format_reward/mean": 0.5625,
1430
+ "rewards/format_reward/std": 0.5013279914855957,
1431
+ "step": 49
1432
+ },
1433
+ {
1434
+ "clip_ratio/high_max": 0.0,
1435
+ "clip_ratio/high_mean": 0.0,
1436
+ "clip_ratio/low_mean": 0.0,
1437
+ "clip_ratio/low_min": 0.0,
1438
+ "clip_ratio/region_mean": 0.0,
1439
+ "completions/clipped_ratio": 0.6666666666666667,
1440
+ "completions/max_length": 3584.0,
1441
+ "completions/max_terminated_length": 2765.0,
1442
+ "completions/mean_length": 2816.8125,
1443
+ "completions/mean_terminated_length": 1282.4375,
1444
+ "completions/min_length": 370.0,
1445
+ "completions/min_terminated_length": 370.0,
1446
+ "epoch": 0.05714285714285714,
1447
+ "frac_reward_zero_std": 0.0,
1448
+ "grad_norm": 0.23478873074054718,
1449
+ "kl": 0.0006268819173177084,
1450
+ "learning_rate": 1.0109617738307911e-07,
1451
+ "loss": -0.0009,
1452
+ "num_tokens": 7179999.0,
1453
+ "policy_entropy_avg": 8.125,
1454
+ "reward": 0.23419660329818726,
1455
+ "reward_std": 0.5556939840316772,
1456
+ "rewards/cosine_scaled_reward/mean": -0.04897995665669441,
1457
+ "rewards/cosine_scaled_reward/std": 0.39337849617004395,
1458
+ "rewards/format_reward/mean": 0.3333333432674408,
1459
+ "rewards/format_reward/std": 0.47639307379722595,
1460
+ "step": 50
1461
+ },
1462
+ {
1463
+ "epoch": 0.05714285714285714,
1464
+ "step": 50,
1465
+ "total_flos": 0.0,
1466
+ "train_loss": 0.00044901110231876373,
1467
+ "train_runtime": 4526.0548,
1468
+ "train_samples_per_second": 0.53,
1469
+ "train_steps_per_second": 0.011
1470
+ }
1471
+ ],
1472
+ "logging_steps": 1,
1473
+ "max_steps": 50,
1474
+ "num_input_tokens_seen": 7179999,
1475
+ "num_train_epochs": 1,
1476
+ "save_steps": 50,
1477
+ "stateful_callbacks": {
1478
+ "TrainerControl": {
1479
+ "args": {
1480
+ "should_epoch_stop": false,
1481
+ "should_evaluate": false,
1482
+ "should_log": false,
1483
+ "should_save": true,
1484
+ "should_training_stop": true
1485
+ },
1486
+ "attributes": {}
1487
+ }
1488
+ },
1489
+ "total_flos": 0.0,
1490
+ "train_batch_size": 4,
1491
+ "trial_name": null,
1492
+ "trial_params": null
1493
+ }